pith. sign in

arxiv: 2605.22036 · v1 · pith:CFUQQNK3new · submitted 2026-05-21 · 💻 cs.CV · cs.AI

GA-VLN: Geometry-Aware BEV Representation for Efficient Vision-Language Navigation

Pith reviewed 2026-05-22 06:41 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords Vision-Language NavigationBird's-Eye-View RepresentationRGB-D Projection3D Geometric PriorsMultimodal Large Language ModelsToken EfficiencySpatial Reasoning
0
0 comments X

The pith

Projecting RGB-D features into agent-centric BEV maps plus 3D priors lets vision-language navigation models reach state-of-the-art performance using only navigation data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that dense RGB video inputs create too many tokens and hide spatial structure in vision-language navigation. It replaces them with a compact bird's-eye-view layout built by lifting RGB-D features into 3D and collapsing them around the agent. A pretrained 3D foundation model adds structural priors to the same layout. The resulting representation keeps geometric relationships while cutting token count, so the model can reason about space more efficiently and still hit top accuracy without data-augmentation tricks or extra question-answering data.

Core claim

By projecting visual features from RGB-D inputs into 3D space and aggregating them into an agent-centric BEV layout, then enriching that layout with features from a pretrained 3D foundation model, the GA-BEV representation supplies both explicit depth geometry and implicit structural priors inside a compact token set that multimodal language models can use directly for navigation.

What carries the argument

Geometry-Aware BEV (GA-BEV) representation: the explicit 3D projection of RGB-D patches followed by agent-centric aggregation, augmented by implicit priors from a 3D foundation model.

If this is right

  • Navigation agents can process shorter token sequences and therefore run with lower compute and memory cost per step.
  • Explicit depth projection combined with 3D priors improves spatial reasoning accuracy without requiring mixed VQA training.
  • Models trained only on navigation trajectories reach state-of-the-art success rates, showing that extra augmentation data is not necessary.
  • The same BEV construction can be swapped into other MLLM-based embodied tasks that need compact spatial context.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The token reduction may allow real-time inference on robots with limited onboard compute.
  • The method could be tested on longer-horizon tasks where spatial consistency matters more than single-step accuracy.
  • Replacing the 3D foundation model with a smaller or domain-specific encoder might further improve efficiency while keeping most gains.

Load-bearing premise

Projecting RGB-D features into 3D and flattening them into an agent-centric BEV map keeps enough geometric detail for correct navigation decisions.

What would settle it

Run the same navigation model on a standard VLN benchmark once with the proposed BEV maps and once with the original dense RGB patches, using identical training data and no DAgger; success rate and token count would decide whether the geometric compression preserves or loses necessary spatial information.

Figures

Figures reproduced from arXiv: 2605.22036 by Jiahao Yang, Shuqiang Jiang, Xiangyang Li, Xing Zhu, Yinghao Xu, Yujun Shen, Zihan Wang.

Figure 1
Figure 1. Figure 1: Illustration of different representations for VLN. (A) Dense image-based representations contain heavy token redun￾dancy and lack explicit spatial structure. (B) Our Geometry-Aware BEV (GA-BEV) representation combines explicit depth-projected features with implicit geometry priors from 3D foundation models, producing a highly compact yet spatially expressive representation tailored for VLN. in visual conte… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the proposed Geometry-Aware Vision-Language Navigation (GA-VLN) framework. Given RGB-D current and historical front views, our method constructs a Geometry-Aware BEV (GA-BEV) representation by combining explicit depth-guided projections with implicit geometry priors from a pretrained 3D foundation model. The projected features are aggregated into BEV grid cells to form compact and spatially exp… view at source ↗
Figure 3
Figure 3. Figure 3: An example of the GA-VLN real-world result. [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Real-world VLN results of GA-VLN: Example #S1. [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Real-world VLN results of GA-VLN: Example #S2. [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Comparison of token usage across navigation steps for different configurations. The number shows in each legend corresponds [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗
read the original abstract

Despite significant progress in Vision-Language Navigation (VLN), existing approaches still rely on dense RGB videos that produce excessive patch tokens and lack explicit spatial structure, resulting in substantial computational overhead and limited spatial reasoning. To address these issues, we introduce the Geometry-Aware BEV (GA-BEV) - a compact, 3D-grounded feature representation that integrates both explicit and implicit geometric cues into multimodal large language model (MLLM) - based navigation systems. We construct BEV spatial maps from RGB-D inputs by projecting visual features into 3D space and aggregating them into an agent-centric layout that preserves geometric consistency while reducing token redundancy. To further enrich geometric understanding, we incorporate features from a pretrained 3D foundation model into the BEV space, injecting structural priors learned from large-scale 3D reconstruction tasks. Together, these complementary cues - explicit depth-based projection and implicit learned priors - yield compact yet spatially expressive representations that substantially improve navigation efficiency and performance. Experiments show that our method achieves state-of-the-art results using only navigation data, without DAgger augmentation or mixed VQA training, demonstrating the robustness and data efficiency of the proposed GA-VLN framework.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The manuscript proposes GA-VLN, a framework for Vision-Language Navigation that introduces a Geometry-Aware BEV (GA-BEV) representation. This is constructed by projecting visual features from RGB-D inputs into 3D space and aggregating them into an agent-centric layout to preserve geometric consistency while reducing token redundancy; pretrained 3D foundation model features are then injected to supply structural priors. The approach is integrated into MLLM-based navigation and claims state-of-the-art performance using only navigation data, without DAgger augmentation or mixed VQA training.

Significance. If the reported results hold, the work offers a concrete route to more efficient VLN by replacing dense RGB token streams with a compact, explicitly geometric BEV map augmented by 3D priors. The demonstrated data efficiency—achieving strong performance from navigation data alone—is a clear strength that could reduce reliance on expensive augmentation pipelines. The explicit separation of depth-based projection and learned 3D priors also provides a useful ablation axis for future geometric VLN research.

minor comments (3)
  1. [§4] §4 (Experiments): the main results table should report absolute success rate, SPL, and navigation error for all baselines and ablations in a single, easily comparable format rather than scattering key numbers across text and supplementary material.
  2. [§3.2] §3.2 (BEV construction): the aggregation step that maps projected 3D points to the agent-centric grid is described only at a high level; adding the explicit binning or interpolation formula would improve reproducibility.
  3. [Figure 3] Figure 3 caption: the visualization of GA-BEV features would be clearer if it explicitly labeled which channels correspond to the depth-projected RGB features versus the injected 3D-foundation-model features.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive review and recommendation for minor revision. We appreciate the recognition of the data efficiency and geometric contributions of GA-VLN.

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper constructs GA-BEV via explicit depth-based projection of RGB-D features into 3D space followed by agent-centric aggregation, plus injection of features from an external pretrained 3D foundation model. These steps rely on standard geometric projection pipelines and off-the-shelf priors rather than any self-referential fitting, parameter estimation from the target navigation outputs, or load-bearing self-citations. Ablations isolate each cue's contribution and demonstrate gains on navigation data alone without DAgger or VQA mixing. No equation reduces a claimed prediction to its own inputs by construction, and the central efficiency and SOTA claims rest on independent experimental controls.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

Abstract-only review limits visibility into exact parameters or background lemmas; listed items are inferred from stated construction steps.

axioms (2)
  • domain assumption RGB-D inputs supply reliable depth values that can be projected into consistent 3D space without significant sensor noise or calibration error.
    Invoked when constructing BEV spatial maps from RGB-D inputs by projecting visual features into 3D space.
  • domain assumption Features from a pretrained 3D foundation model transfer useful structural priors to the navigation BEV space without domain-specific fine-tuning.
    Invoked when incorporating features from a pretrained 3D foundation model into the BEV space.
invented entities (1)
  • GA-BEV representation no independent evidence
    purpose: Compact agent-centric layout that integrates explicit depth cues and implicit 3D priors for MLLM navigation.
    Newly proposed construct that reduces token redundancy while preserving geometric consistency.

pith-pipeline@v0.9.0 · 5755 in / 1430 out tokens · 35353 ms · 2026-05-22T06:41:43.156168+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

51 extracted references · 51 canonical work pages · 8 internal anchors

  1. [1]

    Bevbert: Multimodal map pre-training for language-guided navigation.arXiv preprint arXiv:2212.04385, 2022

    Dong An, Yuankai Qi, Yangguang Li, Yan Huang, Liang Wang, Tieniu Tan, and Jing Shao. Bevbert: Multimodal map pre-training for language-guided navigation.arXiv preprint arXiv:2212.04385, 2022. 2

  2. [2]

    Etpnav: Evolving topo- logical planning for vision-language navigation in continu- ous environments.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024

    Dong An, Hanqing Wang, Wenguan Wang, Zun Wang, Yan Huang, Keji He, and Liang Wang. Etpnav: Evolving topo- logical planning for vision-language navigation in continu- ous environments.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024. 2, 5

  3. [3]

    Vision-and-language navigation: In- terpreting visually-grounded navigation instructions in real environments

    Peter Anderson, Qi Wu, Damien Teney, Jake Bruce, Mark Johnson, Niko S ¨underhauf, Ian Reid, Stephen Gould, and Anton Van Den Hengel. Vision-and-language navigation: In- terpreting visually-grounded navigation instructions in real environments. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 3674–3683,

  4. [4]

    Scanqa: 3d question answering for spatial scene understanding

    Daichi Azuma, Taiki Miyanishi, Shuhei Kurita, and Motoaki Kawanabe. Scanqa: 3d question answering for spatial scene understanding. Inproceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 19129– 19139, 2022. 2, 5

  5. [5]

    Matterport3D: Learning from RGB-D Data in Indoor Environments

    Angel Chang, Angela Dai, Thomas Funkhouser, Maciej Halber, Matthias Niessner, Manolis Savva, Shuran Song, Andy Zeng, and Yinda Zhang. Matterport3d: Learning from rgb-d data in indoor environments.arXiv preprint arXiv:1709.06158, 2017. 5

  6. [6]

    Affordances-oriented planning using foundation models for continuous vision- language navigation

    Jiaqi Chen, Bingqian Lin, Xinmin Liu, Lin Ma, Xiao- dan Liang, and Kwan-Yee K Wong. Affordances-oriented planning using foundation models for continuous vision- language navigation. InProceedings of the AAAI Conference on Artificial Intelligence, pages 23568–23576, 2025. 6

  7. [7]

    Constraint-aware zero-shot vision-language navigation in continuous environ- ments.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

    Kehan Chen, Dong An, Yan Huang, Rongtao Xu, Yifei Su, Yonggen Ling, Ian Reid, and Liang Wang. Constraint-aware zero-shot vision-language navigation in continuous environ- ments.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025. 6

  8. [8]

    Weakly- supervised multi-granularity map learning for vision-and- language navigation.Advances in Neural Information Pro- cessing Systems, 35:38149–38161, 2022

    Peihao Chen, Dongyu Ji, Kunyang Lin, Runhao Zeng, Thomas Li, Mingkui Tan, and Chuang Gan. Weakly- supervised multi-granularity map learning for vision-and- language navigation.Advances in Neural Information Pro- cessing Systems, 35:38149–38161, 2022. 6

  9. [9]

    Think global, act lo- cal: Dual-scale graph transformer for vision-and-language navigation

    Shizhe Chen, Pierre-Louis Guhur, Makarand Tapaswi, Cordelia Schmid, and Ivan Laptev. Think global, act lo- cal: Dual-scale graph transformer for vision-and-language navigation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16537– 16547, 2022. 2, 5

  10. [10]

    NaVILA: Legged Robot Vision-Language-Action Model for Naviga- tion

    An-Chieh Cheng, Yandong Ji, Zhaojing Yang, Zaitian Gongye, Xueyan Zou, Jan Kautz, Erdem Bıyık, Hongxu Yin, Sifei Liu, and Xiaolong Wang. Navila: Legged robot vision-language-action model for navigation.arXiv preprint arXiv:2412.04453, 2024. 1, 3, 4, 5, 6

  11. [11]

    InternNav: InternRobotics’ open platform for building generalized navigation foundation models.https://github.com/InternRobotics/ InternNav, 2025

    InternNav Contributors. InternNav: InternRobotics’ open platform for building generalized navigation foundation models.https://github.com/InternRobotics/ InternNav, 2025. 6

  12. [12]

    Cross-modal map learning for vision and language navigation

    Georgios Georgakis, Karl Schmeckpeper, Karan Wanchoo, Soham Dan, Eleni Miltsakaki, Dan Roth, and Kostas Dani- ilidis. Cross-modal map learning for vision and language navigation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 15460– 15470, 2022. 6

  13. [13]

    Fine-grained alignment supervision matters in vision-and- language navigation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2026

    Keji He, Yan Huang, Ya Jing, Qi Wu, and Liang Wang. Fine-grained alignment supervision matters in vision-and- language navigation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2026. 2

  14. [14]

    3d gaussian splatting for real-time radiance field rendering.ACM Transactions on Graphics (TOG), 42(4):1–14, 2023

    Bernhard Kerbl, Georgios Kopanas, Thomas Leimkuehler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM Transactions on Graphics (TOG), 42(4):1–14, 2023. 2

  15. [15]

    Beyond the nav-graph: Vision-and-language navigation in continuous environments

    Jacob Krantz, Erik Wijmans, Arjun Majumdar, Dhruv Batra, and Stefan Lee. Beyond the nav-graph: Vision-and-language navigation in continuous environments. InEuropean Confer- ence on Computer Vision, pages 104–120. Springer, 2020. 1, 2, 5

  16. [16]

    Room-across-room: Multilingual vision- and-language navigation with dense spatiotemporal ground- ing.arXiv preprint arXiv:2010.07954, 2020

    Alexander Ku, Peter Anderson, Roma Patel, Eugene Ie, and Jason Baldridge. Room-across-room: Multilingual vision- and-language navigation with dense spatiotemporal ground- ing.arXiv preprint arXiv:2010.07954, 2020. 1, 2, 5

  17. [17]

    LLaVA-OneVision: Easy Visual Task Transfer

    Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Zi- wei Liu, et al. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326, 2024. 1

  18. [18]

    Bird’s-eye-view scene graph for vision-language navigation

    Rui Liu, Xiaohan Wang, Wenguan Wang, and Yi Yang. Bird’s-eye-view scene graph for vision-language navigation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 10968–10980, 2023. 2

  19. [19]

    Instructnav: Zero-shot system for generic instruction navigation in unexplored environment.arXiv preprint arXiv:2406.04882, 2024

    Yuxing Long, Wenzhe Cai, Hongcheng Wang, Guanqi Zhan, and Hao Dong. Instructnav: Zero-shot system for generic instruction navigation in unexplored environment.arXiv preprint arXiv:2406.04882, 2024. 6

  20. [20]

    monovln: Bridging the observation gap between monocular and panoramic vision and language navigation

    Renjie Lu, Yu Zhou, Hao Cheng, Jingke Meng, and Wei- Shi Zheng. monovln: Bridging the observation gap between monocular and panoramic vision and language navigation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 9477–9486, 2025. 2

  21. [21]

    Nerf: Representing scenes as neural radiance fields for view syn- thesis.Communications of the ACM, 65(1):99–106, 2021

    Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view syn- thesis.Communications of the ACM, 65(1):99–106, 2021. 2

  22. [22]

    Reverie: Remote embodied visual referring ex- pression in real indoor environments

    Yuankai Qi, Qi Wu, Peter Anderson, Xin Wang, William Yang Wang, Chunhua Shen, and Anton van den Hengel. Reverie: Remote embodied visual referring ex- pression in real indoor environments. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9982–9991, 2020. 1, 2

  23. [23]

    Habitat-Matterport 3D Dataset (HM3D): 1000 Large-scale 3D Environments for Embodied AI

    Santhosh K Ramakrishnan, Aaron Gokaslan, Erik Wijmans, Oleksandr Maksymets, Alex Clegg, John Turner, Eric Un- dersander, Wojciech Galuba, Andrew Westbury, Angel X Chang, et al. Habitat-matterport 3d dataset (hm3d): 1000 large-scale 3d environments for embodied ai.arXiv preprint arXiv:2109.08238, 2021. 5

  24. [24]

    Language-aligned waypoint (law) super- vision for vision-and-language navigation in continuous en- vironments

    Sonia Raychaudhuri, Saim Wani, Shivansh Patel, Unnat Jain, and Angel Chang. Language-aligned waypoint (law) super- vision for vision-and-language navigation in continuous en- vironments. InProceedings of the 2021 conference on em- pirical methods in natural language processing, pages 4018– 4028, 2021. 6

  25. [25]

    Habitat: A platform for embodied ai research

    Manolis Savva, Abhishek Kadian, Oleksandr Maksymets, Yili Zhao, Erik Wijmans, Bhavana Jain, Julian Straub, Jia Liu, Vladlen Koltun, Jitendra Malik, et al. Habitat: A platform for embodied ai research. InProceedings of the IEEE/CVF international conference on computer vision, pages 9339–9347, 2019. 5

  26. [26]

    Learning to Navigate Unseen Environments: Back Translation with Environmental Dropout

    Hao Tan, Licheng Yu, and Mohit Bansal. Learning to nav- igate unseen environments: Back translation with environ- mental dropout.arXiv preprint arXiv:1904.04195, 2019. 5

  27. [27]

    Vggt: Vi- sual geometry grounded transformer

    Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Vi- sual geometry grounded transformer. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5294–5306, 2025. 4, 5

  28. [28]

    Aux-think: Exploring reason- ing strategies for data-efficient vision-language navigation

    Shuo Wang, Yongcai Wang, Wanting Li, Xudong Cai, Yucheng Wang, Maiyue Chen, Kaihui Wang, Zhizhong Su, Deying Li, and Zhaoxin Fan. Aux-think: Exploring reason- ing strategies for data-efficient vision-language navigation. arXiv preprint arXiv:2505.11886, 2025. 1, 6

  29. [29]

    Monodream: Monocular vision-language navigation with panoramic dreaming.arXiv preprint arXiv:2508.02549, 2025

    Shuo Wang, Yongcai Wang, Wanting Li, Yucheng Wang, Maiyue Chen, Kaihui Wang, Zhizhong Su, Xudong Cai, Yeying Jin, Deying Li, et al. Monodream: Monocular vision-language navigation with panoramic dreaming.arXiv preprint arXiv:2508.02549, 2025. 1, 6

  30. [30]

    Dreamnav: A trajectory-based imaginative frame- work for zero-shot vision-and-language navigation.arXiv preprint arXiv:2509.11197, 2025

    Yunheng Wang, Yuetong Fang, Taowen Wang, Yixiao Feng, Yawen Tan, Shuning Zhang, Peiran Liu, Yiding Ji, and Ren- jing Xu. Dreamnav: A trajectory-based imaginative frame- work for zero-shot vision-and-language navigation.arXiv preprint arXiv:2509.11197, 2025. 6

  31. [31]

    g3d-lf: Generalizable 3d- language feature fields for embodied tasks

    Zihan Wang and Gim Hee Lee. g3d-lf: Generalizable 3d- language feature fields for embodied tasks. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 14191–14202, 2025. 2, 5, 6

  32. [32]

    Scaling data generation in vision-and-language navigation

    Zun Wang, Jialu Li, Yicong Hong, Yi Wang, Qi Wu, Mohit Bansal, Stephen Gould, Hao Tan, and Yu Qiao. Scaling data generation in vision-and-language navigation. InProceed- ings of the IEEE/CVF international conference on computer vision, pages 12009–12020, 2023. 5

  33. [33]

    Gridmm: Grid memory map for vision- and-language navigation

    Zihan Wang, Xiangyang Li, Jiahao Yang, Yeqi Liu, and Shuqiang Jiang. Gridmm: Grid memory map for vision- and-language navigation. InProceedings of the IEEE/CVF International conference on computer vision, pages 15625– 15636, 2023. 2

  34. [34]

    Bootstrapping language-guided navigation learning with self-refining data flywheel.arXiv preprint arXiv:2412.08467, 2024

    Zun Wang, Jialu Li, Yicong Hong, Songze Li, Kunchang Li, Shoubin Yu, Yi Wang, Yu Qiao, Yali Wang, Mohit Bansal, et al. Bootstrapping language-guided navigation learning with self-refining data flywheel.arXiv preprint arXiv:2412.08467, 2024. 5

  35. [35]

    Lookahead exploration with neural radiance representation for continuous vision- language navigation

    Zihan Wang, Xiangyang Li, Jiahao Yang, Yeqi Liu, Junjie Hu, Ming Jiang, and Shuqiang Jiang. Lookahead exploration with neural radiance representation for continuous vision- language navigation. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 13753–13762, 2024. 2

  36. [36]

    Sim-to-real transfer via 3d feature fields for vision-and-language navigation.arXiv preprint arXiv:2406.09798, 2024

    Zihan Wang, Xiangyang Li, Jiahao Yang, Yeqi Liu, and Shuqiang Jiang. Sim-to-real transfer via 3d feature fields for vision-and-language navigation.arXiv preprint arXiv:2406.09798, 2024. 5, 6

  37. [37]

    Dynam3d: Dynamic layered 3d tokens empower vlm for vision-and- language navigation.arXiv preprint arXiv:2505.11383,

    Zihan Wang, Seungjun Lee, and Gim Hee Lee. Dynam3d: Dynamic layered 3d tokens empower vlm for vision-and- language navigation.arXiv preprint arXiv:2505.11383,

  38. [38]

    Navrag: Generating user demand instructions for embodied navigation through retrieval-augmented llm.arXiv preprint arXiv:2502.11142, 2025

    Zihan Wang, Yaohui Zhu, Gim Hee Lee, and Yachun Fan. Navrag: Generating user demand instructions for embodied navigation through retrieval-augmented llm.arXiv preprint arXiv:2502.11142, 2025. 1, 2, 5

  39. [39]

    Streamvln: Streaming vision-and- language navigation via slowfast context modeling.arXiv preprint arXiv:2507.05240, 2025

    Meng Wei, Chenyang Wan, Xiqian Yu, Tai Wang, Yuqiang Yang, Xiaohan Mao, Chenming Zhu, Wenzhe Cai, Hanqing Wang, Yilun Chen, et al. Streamvln: Streaming vision-and- language navigation via slowfast context modeling.arXiv preprint arXiv:2507.05240, 2025. 1, 2, 3, 4, 5, 6

  40. [40]

    Sigmoid loss for language image pre-training

    Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InProceedings of the IEEE/CVF international conference on computer vision, pages 11975–11986, 2023. 4, 5

  41. [41]

    Uni-NaVid: A Video-based Vision-Language-Action Model for Unifying Embodied Navigation Tasks

    Jiazhao Zhang, Kunyu Wang, Shaoan Wang, Minghan Li, Haoran Liu, Songlin Wei, Zhongyuan Wang, Zhizheng Zhang, and He Wang. Uni-navid: A video-based vision- language-action model for unifying embodied navigation tasks.arXiv preprint arXiv:2412.06224, 2024. 1, 3, 4, 5, 6

  42. [42]

    NaVid: Video-based VLM Plans the Next Step for Vision-and-Language Navigation

    Jiazhao Zhang, Kunyu Wang, Rongtao Xu, Gengze Zhou, Yicong Hong, Xiaomeng Fang, Qi Wu, Zhizheng Zhang, and He Wang. Navid: Video-based vlm plans the next step for vision-and-language navigation.arXiv preprint arXiv:2402.15852, 2024. 3, 4, 5, 6

  43. [43]

    Embodied navigation foundation model

    Jiazhao Zhang, Anqi Li, Yunpeng Qi, Minghan Li, Jiahang Liu, Shaoan Wang, Haoran Liu, Gengze Zhou, Yuze Wu, Xingxing Li, et al. Embodied navigation foundation model. arXiv preprint arXiv:2509.12129, 2025. 1

  44. [44]

    MapNav: A Novel Memory Representation via Annotated Semantic Maps for Vision-and-Language Navigation

    L Zhang, X Hao, Q Xu, Q Zhang, X Zhang, P Wang, J Zhang, Z Wang, S Zhang, and R MapNav Xu. A novel memory representation via annotated semantic maps for vlm-based vision-and-language navigation.arXiv preprint arXiv:2502.13451, 2025. 6

  45. [45]

    Hierarchical object-to-zone graph for object navigation

    Sixian Zhang, Xinhang Song, Yubing Bai, Weijie Li, Yakui Chu, and Shuqiang Jiang. Hierarchical object-to-zone graph for object navigation. InProceedings of the IEEE/CVF in- ternational conference on computer vision, pages 15130– 15140, 2021

  46. [46]

    Generative meta-adversarial network for unseen object navigation

    Sixian Zhang, Weijie Li, Xinhang Song, Yubing Bai, and Shuqiang Jiang. Generative meta-adversarial network for unseen object navigation. InEuropean Conference on Com- puter Vision, pages 301–320. Springer, 2022

  47. [47]

    Imagine before go: Self-supervised generative map for object goal navigation

    Sixian Zhang, Xinyao Yu, Xinhang Song, Xiaohan Wang, and Shuqiang Jiang. Imagine before go: Self-supervised generative map for object goal navigation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 16414–16425, 2024

  48. [48]

    LLaVA-Video: Video Instruction Tuning With Synthetic Data

    Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun Ma, Zi- wei Liu, and Chunyuan Li. Video instruction tuning with synthetic data.arXiv preprint arXiv:2410.02713, 2024. 1, 5

  49. [49]

    Towards learning a generalist model for embod- ied navigation

    Duo Zheng, Shijia Huang, Lin Zhao, Yiwu Zhong, and Li- wei Wang. Towards learning a generalist model for embod- ied navigation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13624– 13634, 2024. 3

  50. [50]

    Learning from videos for 3d world: Enhancing mllms with 3d vision geometry priors.arXiv preprint arXiv:2505.24625, 2025

    Duo Zheng, Shijia Huang, Yanyang Li, and Liwei Wang. Learning from videos for 3d world: Enhancing mllms with 3d vision geometry priors.arXiv preprint arXiv:2505.24625,

  51. [51]

    Navgpt-2: Unleashing navigational reasoning capa- bility for large vision-language models

    Gengze Zhou, Yicong Hong, Zun Wang, Xin Eric Wang, and Qi Wu. Navgpt-2: Unleashing navigational reasoning capa- bility for large vision-language models. InEuropean Con- ference on Computer Vision, pages 260–278. Springer, 2024. 3 GA-VLN: Geometry-Aware BEV Representation for Efficient Vision-Language Navigation Supplementary Material A. Real-World Robot E...