GA-VLN: Geometry-Aware BEV Representation for Efficient Vision-Language Navigation
Pith reviewed 2026-05-22 06:41 UTC · model grok-4.3
The pith
Projecting RGB-D features into agent-centric BEV maps plus 3D priors lets vision-language navigation models reach state-of-the-art performance using only navigation data.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By projecting visual features from RGB-D inputs into 3D space and aggregating them into an agent-centric BEV layout, then enriching that layout with features from a pretrained 3D foundation model, the GA-BEV representation supplies both explicit depth geometry and implicit structural priors inside a compact token set that multimodal language models can use directly for navigation.
What carries the argument
Geometry-Aware BEV (GA-BEV) representation: the explicit 3D projection of RGB-D patches followed by agent-centric aggregation, augmented by implicit priors from a 3D foundation model.
If this is right
- Navigation agents can process shorter token sequences and therefore run with lower compute and memory cost per step.
- Explicit depth projection combined with 3D priors improves spatial reasoning accuracy without requiring mixed VQA training.
- Models trained only on navigation trajectories reach state-of-the-art success rates, showing that extra augmentation data is not necessary.
- The same BEV construction can be swapped into other MLLM-based embodied tasks that need compact spatial context.
Where Pith is reading between the lines
- The token reduction may allow real-time inference on robots with limited onboard compute.
- The method could be tested on longer-horizon tasks where spatial consistency matters more than single-step accuracy.
- Replacing the 3D foundation model with a smaller or domain-specific encoder might further improve efficiency while keeping most gains.
Load-bearing premise
Projecting RGB-D features into 3D and flattening them into an agent-centric BEV map keeps enough geometric detail for correct navigation decisions.
What would settle it
Run the same navigation model on a standard VLN benchmark once with the proposed BEV maps and once with the original dense RGB patches, using identical training data and no DAgger; success rate and token count would decide whether the geometric compression preserves or loses necessary spatial information.
Figures
read the original abstract
Despite significant progress in Vision-Language Navigation (VLN), existing approaches still rely on dense RGB videos that produce excessive patch tokens and lack explicit spatial structure, resulting in substantial computational overhead and limited spatial reasoning. To address these issues, we introduce the Geometry-Aware BEV (GA-BEV) - a compact, 3D-grounded feature representation that integrates both explicit and implicit geometric cues into multimodal large language model (MLLM) - based navigation systems. We construct BEV spatial maps from RGB-D inputs by projecting visual features into 3D space and aggregating them into an agent-centric layout that preserves geometric consistency while reducing token redundancy. To further enrich geometric understanding, we incorporate features from a pretrained 3D foundation model into the BEV space, injecting structural priors learned from large-scale 3D reconstruction tasks. Together, these complementary cues - explicit depth-based projection and implicit learned priors - yield compact yet spatially expressive representations that substantially improve navigation efficiency and performance. Experiments show that our method achieves state-of-the-art results using only navigation data, without DAgger augmentation or mixed VQA training, demonstrating the robustness and data efficiency of the proposed GA-VLN framework.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes GA-VLN, a framework for Vision-Language Navigation that introduces a Geometry-Aware BEV (GA-BEV) representation. This is constructed by projecting visual features from RGB-D inputs into 3D space and aggregating them into an agent-centric layout to preserve geometric consistency while reducing token redundancy; pretrained 3D foundation model features are then injected to supply structural priors. The approach is integrated into MLLM-based navigation and claims state-of-the-art performance using only navigation data, without DAgger augmentation or mixed VQA training.
Significance. If the reported results hold, the work offers a concrete route to more efficient VLN by replacing dense RGB token streams with a compact, explicitly geometric BEV map augmented by 3D priors. The demonstrated data efficiency—achieving strong performance from navigation data alone—is a clear strength that could reduce reliance on expensive augmentation pipelines. The explicit separation of depth-based projection and learned 3D priors also provides a useful ablation axis for future geometric VLN research.
minor comments (3)
- [§4] §4 (Experiments): the main results table should report absolute success rate, SPL, and navigation error for all baselines and ablations in a single, easily comparable format rather than scattering key numbers across text and supplementary material.
- [§3.2] §3.2 (BEV construction): the aggregation step that maps projected 3D points to the agent-centric grid is described only at a high level; adding the explicit binning or interpolation formula would improve reproducibility.
- [Figure 3] Figure 3 caption: the visualization of GA-BEV features would be clearer if it explicitly labeled which channels correspond to the depth-projected RGB features versus the injected 3D-foundation-model features.
Simulated Author's Rebuttal
We thank the referee for the positive review and recommendation for minor revision. We appreciate the recognition of the data efficiency and geometric contributions of GA-VLN.
Circularity Check
No significant circularity; derivation is self-contained
full rationale
The paper constructs GA-BEV via explicit depth-based projection of RGB-D features into 3D space followed by agent-centric aggregation, plus injection of features from an external pretrained 3D foundation model. These steps rely on standard geometric projection pipelines and off-the-shelf priors rather than any self-referential fitting, parameter estimation from the target navigation outputs, or load-bearing self-citations. Ablations isolate each cue's contribution and demonstrate gains on navigation data alone without DAgger or VQA mixing. No equation reduces a claimed prediction to its own inputs by construction, and the central efficiency and SOTA claims rest on independent experimental controls.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption RGB-D inputs supply reliable depth values that can be projected into consistent 3D space without significant sensor noise or calibration error.
- domain assumption Features from a pretrained 3D foundation model transfer useful structural priors to the navigation BEV space without domain-specific fine-tuning.
invented entities (1)
-
GA-BEV representation
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We construct BEV spatial maps from RGB-D inputs by projecting visual features into 3D space and aggregating them into an agent-centric layout that preserves geometric consistency while reducing token redundancy.
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Experiments show that our method achieves state-of-the-art results using only navigation data, without DAgger augmentation or mixed VQA training.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Dong An, Yuankai Qi, Yangguang Li, Yan Huang, Liang Wang, Tieniu Tan, and Jing Shao. Bevbert: Multimodal map pre-training for language-guided navigation.arXiv preprint arXiv:2212.04385, 2022. 2
-
[2]
Dong An, Hanqing Wang, Wenguan Wang, Zun Wang, Yan Huang, Keji He, and Liang Wang. Etpnav: Evolving topo- logical planning for vision-language navigation in continu- ous environments.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024. 2, 5
work page 2024
-
[3]
Peter Anderson, Qi Wu, Damien Teney, Jake Bruce, Mark Johnson, Niko S ¨underhauf, Ian Reid, Stephen Gould, and Anton Van Den Hengel. Vision-and-language navigation: In- terpreting visually-grounded navigation instructions in real environments. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 3674–3683,
-
[4]
Scanqa: 3d question answering for spatial scene understanding
Daichi Azuma, Taiki Miyanishi, Shuhei Kurita, and Motoaki Kawanabe. Scanqa: 3d question answering for spatial scene understanding. Inproceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 19129– 19139, 2022. 2, 5
work page 2022
-
[5]
Matterport3D: Learning from RGB-D Data in Indoor Environments
Angel Chang, Angela Dai, Thomas Funkhouser, Maciej Halber, Matthias Niessner, Manolis Savva, Shuran Song, Andy Zeng, and Yinda Zhang. Matterport3d: Learning from rgb-d data in indoor environments.arXiv preprint arXiv:1709.06158, 2017. 5
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[6]
Affordances-oriented planning using foundation models for continuous vision- language navigation
Jiaqi Chen, Bingqian Lin, Xinmin Liu, Lin Ma, Xiao- dan Liang, and Kwan-Yee K Wong. Affordances-oriented planning using foundation models for continuous vision- language navigation. InProceedings of the AAAI Conference on Artificial Intelligence, pages 23568–23576, 2025. 6
work page 2025
-
[7]
Kehan Chen, Dong An, Yan Huang, Rongtao Xu, Yifei Su, Yonggen Ling, Ian Reid, and Liang Wang. Constraint-aware zero-shot vision-language navigation in continuous environ- ments.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025. 6
work page 2025
-
[8]
Peihao Chen, Dongyu Ji, Kunyang Lin, Runhao Zeng, Thomas Li, Mingkui Tan, and Chuang Gan. Weakly- supervised multi-granularity map learning for vision-and- language navigation.Advances in Neural Information Pro- cessing Systems, 35:38149–38161, 2022. 6
work page 2022
-
[9]
Think global, act lo- cal: Dual-scale graph transformer for vision-and-language navigation
Shizhe Chen, Pierre-Louis Guhur, Makarand Tapaswi, Cordelia Schmid, and Ivan Laptev. Think global, act lo- cal: Dual-scale graph transformer for vision-and-language navigation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16537– 16547, 2022. 2, 5
work page 2022
-
[10]
NaVILA: Legged Robot Vision-Language-Action Model for Naviga- tion
An-Chieh Cheng, Yandong Ji, Zhaojing Yang, Zaitian Gongye, Xueyan Zou, Jan Kautz, Erdem Bıyık, Hongxu Yin, Sifei Liu, and Xiaolong Wang. Navila: Legged robot vision-language-action model for navigation.arXiv preprint arXiv:2412.04453, 2024. 1, 3, 4, 5, 6
-
[11]
InternNav Contributors. InternNav: InternRobotics’ open platform for building generalized navigation foundation models.https://github.com/InternRobotics/ InternNav, 2025. 6
work page 2025
-
[12]
Cross-modal map learning for vision and language navigation
Georgios Georgakis, Karl Schmeckpeper, Karan Wanchoo, Soham Dan, Eleni Miltsakaki, Dan Roth, and Kostas Dani- ilidis. Cross-modal map learning for vision and language navigation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 15460– 15470, 2022. 6
work page 2022
-
[13]
Keji He, Yan Huang, Ya Jing, Qi Wu, and Liang Wang. Fine-grained alignment supervision matters in vision-and- language navigation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2026. 2
work page 2026
-
[14]
Bernhard Kerbl, Georgios Kopanas, Thomas Leimkuehler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM Transactions on Graphics (TOG), 42(4):1–14, 2023. 2
work page 2023
-
[15]
Beyond the nav-graph: Vision-and-language navigation in continuous environments
Jacob Krantz, Erik Wijmans, Arjun Majumdar, Dhruv Batra, and Stefan Lee. Beyond the nav-graph: Vision-and-language navigation in continuous environments. InEuropean Confer- ence on Computer Vision, pages 104–120. Springer, 2020. 1, 2, 5
work page 2020
-
[16]
Alexander Ku, Peter Anderson, Roma Patel, Eugene Ie, and Jason Baldridge. Room-across-room: Multilingual vision- and-language navigation with dense spatiotemporal ground- ing.arXiv preprint arXiv:2010.07954, 2020. 1, 2, 5
-
[17]
LLaVA-OneVision: Easy Visual Task Transfer
Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Zi- wei Liu, et al. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326, 2024. 1
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[18]
Bird’s-eye-view scene graph for vision-language navigation
Rui Liu, Xiaohan Wang, Wenguan Wang, and Yi Yang. Bird’s-eye-view scene graph for vision-language navigation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 10968–10980, 2023. 2
work page 2023
-
[19]
Yuxing Long, Wenzhe Cai, Hongcheng Wang, Guanqi Zhan, and Hao Dong. Instructnav: Zero-shot system for generic instruction navigation in unexplored environment.arXiv preprint arXiv:2406.04882, 2024. 6
-
[20]
monovln: Bridging the observation gap between monocular and panoramic vision and language navigation
Renjie Lu, Yu Zhou, Hao Cheng, Jingke Meng, and Wei- Shi Zheng. monovln: Bridging the observation gap between monocular and panoramic vision and language navigation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 9477–9486, 2025. 2
work page 2025
-
[21]
Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view syn- thesis.Communications of the ACM, 65(1):99–106, 2021. 2
work page 2021
-
[22]
Reverie: Remote embodied visual referring ex- pression in real indoor environments
Yuankai Qi, Qi Wu, Peter Anderson, Xin Wang, William Yang Wang, Chunhua Shen, and Anton van den Hengel. Reverie: Remote embodied visual referring ex- pression in real indoor environments. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9982–9991, 2020. 1, 2
work page 2020
-
[23]
Habitat-Matterport 3D Dataset (HM3D): 1000 Large-scale 3D Environments for Embodied AI
Santhosh K Ramakrishnan, Aaron Gokaslan, Erik Wijmans, Oleksandr Maksymets, Alex Clegg, John Turner, Eric Un- dersander, Wojciech Galuba, Andrew Westbury, Angel X Chang, et al. Habitat-matterport 3d dataset (hm3d): 1000 large-scale 3d environments for embodied ai.arXiv preprint arXiv:2109.08238, 2021. 5
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[24]
Sonia Raychaudhuri, Saim Wani, Shivansh Patel, Unnat Jain, and Angel Chang. Language-aligned waypoint (law) super- vision for vision-and-language navigation in continuous en- vironments. InProceedings of the 2021 conference on em- pirical methods in natural language processing, pages 4018– 4028, 2021. 6
work page 2021
-
[25]
Habitat: A platform for embodied ai research
Manolis Savva, Abhishek Kadian, Oleksandr Maksymets, Yili Zhao, Erik Wijmans, Bhavana Jain, Julian Straub, Jia Liu, Vladlen Koltun, Jitendra Malik, et al. Habitat: A platform for embodied ai research. InProceedings of the IEEE/CVF international conference on computer vision, pages 9339–9347, 2019. 5
work page 2019
-
[26]
Learning to Navigate Unseen Environments: Back Translation with Environmental Dropout
Hao Tan, Licheng Yu, and Mohit Bansal. Learning to nav- igate unseen environments: Back translation with environ- mental dropout.arXiv preprint arXiv:1904.04195, 2019. 5
work page internal anchor Pith review Pith/arXiv arXiv 1904
-
[27]
Vggt: Vi- sual geometry grounded transformer
Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Vi- sual geometry grounded transformer. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5294–5306, 2025. 4, 5
work page 2025
-
[28]
Aux-think: Exploring reason- ing strategies for data-efficient vision-language navigation
Shuo Wang, Yongcai Wang, Wanting Li, Xudong Cai, Yucheng Wang, Maiyue Chen, Kaihui Wang, Zhizhong Su, Deying Li, and Zhaoxin Fan. Aux-think: Exploring reason- ing strategies for data-efficient vision-language navigation. arXiv preprint arXiv:2505.11886, 2025. 1, 6
-
[29]
Shuo Wang, Yongcai Wang, Wanting Li, Yucheng Wang, Maiyue Chen, Kaihui Wang, Zhizhong Su, Xudong Cai, Yeying Jin, Deying Li, et al. Monodream: Monocular vision-language navigation with panoramic dreaming.arXiv preprint arXiv:2508.02549, 2025. 1, 6
-
[30]
Yunheng Wang, Yuetong Fang, Taowen Wang, Yixiao Feng, Yawen Tan, Shuning Zhang, Peiran Liu, Yiding Ji, and Ren- jing Xu. Dreamnav: A trajectory-based imaginative frame- work for zero-shot vision-and-language navigation.arXiv preprint arXiv:2509.11197, 2025. 6
-
[31]
g3d-lf: Generalizable 3d- language feature fields for embodied tasks
Zihan Wang and Gim Hee Lee. g3d-lf: Generalizable 3d- language feature fields for embodied tasks. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 14191–14202, 2025. 2, 5, 6
work page 2025
-
[32]
Scaling data generation in vision-and-language navigation
Zun Wang, Jialu Li, Yicong Hong, Yi Wang, Qi Wu, Mohit Bansal, Stephen Gould, Hao Tan, and Yu Qiao. Scaling data generation in vision-and-language navigation. InProceed- ings of the IEEE/CVF international conference on computer vision, pages 12009–12020, 2023. 5
work page 2023
-
[33]
Gridmm: Grid memory map for vision- and-language navigation
Zihan Wang, Xiangyang Li, Jiahao Yang, Yeqi Liu, and Shuqiang Jiang. Gridmm: Grid memory map for vision- and-language navigation. InProceedings of the IEEE/CVF International conference on computer vision, pages 15625– 15636, 2023. 2
work page 2023
-
[34]
Zun Wang, Jialu Li, Yicong Hong, Songze Li, Kunchang Li, Shoubin Yu, Yi Wang, Yu Qiao, Yali Wang, Mohit Bansal, et al. Bootstrapping language-guided navigation learning with self-refining data flywheel.arXiv preprint arXiv:2412.08467, 2024. 5
-
[35]
Lookahead exploration with neural radiance representation for continuous vision- language navigation
Zihan Wang, Xiangyang Li, Jiahao Yang, Yeqi Liu, Junjie Hu, Ming Jiang, and Shuqiang Jiang. Lookahead exploration with neural radiance representation for continuous vision- language navigation. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 13753–13762, 2024. 2
work page 2024
-
[36]
Zihan Wang, Xiangyang Li, Jiahao Yang, Yeqi Liu, and Shuqiang Jiang. Sim-to-real transfer via 3d feature fields for vision-and-language navigation.arXiv preprint arXiv:2406.09798, 2024. 5, 6
-
[37]
Zihan Wang, Seungjun Lee, and Gim Hee Lee. Dynam3d: Dynamic layered 3d tokens empower vlm for vision-and- language navigation.arXiv preprint arXiv:2505.11383,
-
[38]
Zihan Wang, Yaohui Zhu, Gim Hee Lee, and Yachun Fan. Navrag: Generating user demand instructions for embodied navigation through retrieval-augmented llm.arXiv preprint arXiv:2502.11142, 2025. 1, 2, 5
-
[39]
Meng Wei, Chenyang Wan, Xiqian Yu, Tai Wang, Yuqiang Yang, Xiaohan Mao, Chenming Zhu, Wenzhe Cai, Hanqing Wang, Yilun Chen, et al. Streamvln: Streaming vision-and- language navigation via slowfast context modeling.arXiv preprint arXiv:2507.05240, 2025. 1, 2, 3, 4, 5, 6
-
[40]
Sigmoid loss for language image pre-training
Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InProceedings of the IEEE/CVF international conference on computer vision, pages 11975–11986, 2023. 4, 5
work page 2023
-
[41]
Uni-NaVid: A Video-based Vision-Language-Action Model for Unifying Embodied Navigation Tasks
Jiazhao Zhang, Kunyu Wang, Shaoan Wang, Minghan Li, Haoran Liu, Songlin Wei, Zhongyuan Wang, Zhizheng Zhang, and He Wang. Uni-navid: A video-based vision- language-action model for unifying embodied navigation tasks.arXiv preprint arXiv:2412.06224, 2024. 1, 3, 4, 5, 6
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[42]
NaVid: Video-based VLM Plans the Next Step for Vision-and-Language Navigation
Jiazhao Zhang, Kunyu Wang, Rongtao Xu, Gengze Zhou, Yicong Hong, Xiaomeng Fang, Qi Wu, Zhizheng Zhang, and He Wang. Navid: Video-based vlm plans the next step for vision-and-language navigation.arXiv preprint arXiv:2402.15852, 2024. 3, 4, 5, 6
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[43]
Embodied navigation foundation model
Jiazhao Zhang, Anqi Li, Yunpeng Qi, Minghan Li, Jiahang Liu, Shaoan Wang, Haoran Liu, Gengze Zhou, Yuze Wu, Xingxing Li, et al. Embodied navigation foundation model. arXiv preprint arXiv:2509.12129, 2025. 1
-
[44]
MapNav: A Novel Memory Representation via Annotated Semantic Maps for Vision-and-Language Navigation
L Zhang, X Hao, Q Xu, Q Zhang, X Zhang, P Wang, J Zhang, Z Wang, S Zhang, and R MapNav Xu. A novel memory representation via annotated semantic maps for vlm-based vision-and-language navigation.arXiv preprint arXiv:2502.13451, 2025. 6
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[45]
Hierarchical object-to-zone graph for object navigation
Sixian Zhang, Xinhang Song, Yubing Bai, Weijie Li, Yakui Chu, and Shuqiang Jiang. Hierarchical object-to-zone graph for object navigation. InProceedings of the IEEE/CVF in- ternational conference on computer vision, pages 15130– 15140, 2021
work page 2021
-
[46]
Generative meta-adversarial network for unseen object navigation
Sixian Zhang, Weijie Li, Xinhang Song, Yubing Bai, and Shuqiang Jiang. Generative meta-adversarial network for unseen object navigation. InEuropean Conference on Com- puter Vision, pages 301–320. Springer, 2022
work page 2022
-
[47]
Imagine before go: Self-supervised generative map for object goal navigation
Sixian Zhang, Xinyao Yu, Xinhang Song, Xiaohan Wang, and Shuqiang Jiang. Imagine before go: Self-supervised generative map for object goal navigation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 16414–16425, 2024
work page 2024
-
[48]
LLaVA-Video: Video Instruction Tuning With Synthetic Data
Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun Ma, Zi- wei Liu, and Chunyuan Li. Video instruction tuning with synthetic data.arXiv preprint arXiv:2410.02713, 2024. 1, 5
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[49]
Towards learning a generalist model for embod- ied navigation
Duo Zheng, Shijia Huang, Lin Zhao, Yiwu Zhong, and Li- wei Wang. Towards learning a generalist model for embod- ied navigation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13624– 13634, 2024. 3
work page 2024
-
[50]
Duo Zheng, Shijia Huang, Yanyang Li, and Liwei Wang. Learning from videos for 3d world: Enhancing mllms with 3d vision geometry priors.arXiv preprint arXiv:2505.24625,
-
[51]
Navgpt-2: Unleashing navigational reasoning capa- bility for large vision-language models
Gengze Zhou, Yicong Hong, Zun Wang, Xin Eric Wang, and Qi Wu. Navgpt-2: Unleashing navigational reasoning capa- bility for large vision-language models. InEuropean Con- ference on Computer Vision, pages 260–278. Springer, 2024. 3 GA-VLN: Geometry-Aware BEV Representation for Efficient Vision-Language Navigation Supplementary Material A. Real-World Robot E...
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.