pith. sign in

arxiv: 2606.00095 · v1 · pith:VGLOYHWFnew · submitted 2026-05-25 · 💻 cs.CV · cs.AI· cs.CL· cs.RO

Bridging the 2D-3D Gap: A Hierarchical Semantic-Geometric Map for Vision Language Navigation

Pith reviewed 2026-06-29 22:30 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CLcs.RO
keywords Vision-Language NavigationHierarchical Semantic-Geometric MapZero-shot NavigationEmbodied AI3D Spatial ReasoningVLM Waypoint PlanningSubtask Decomposition
0
0 comments X

The pith

A three-level top-down map lets VLMs select waypoints for zero-shot 3D navigation while classical planning handles movement.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces the Hierarchical Semantic-Geometric Map to close the gap between VLMs' strength in 2D and language and their weakness in 3D spatial reasoning for embodied navigation. The map converts 3D geometry into a structured top-down format with separate channels for navigable space, object relations, and high-level decisions. By letting the VLM plan only at the waypoint level and delegating collision-free paths to a classical algorithm, plus breaking long instructions into subtasks, the approach achieves strong zero-shot results on standard benchmarks.

Core claim

HSGM is a multi-channel top-down map with a geometric level for regions and obstacles, a semantic level for objects and relations, and a decision level for task reasoning. The VLM interprets this map to choose geometrically valid waypoints as a high-level planner, while low-level path planning executes collision-free moves between waypoints. Complex instructions are decomposed into subtasks to prevent progress forgetting or hallucination during long-horizon navigation.

What carries the argument

The Hierarchical Semantic-Geometric Map (HSGM), a multi-channel top-down representation organized into geometric, semantic, and decision levels that encodes 3D spatial layout for VLM-based waypoint selection.

If this is right

  • The VLM acts only as a semantic planner selecting waypoints from the map layout.
  • Subtask decomposition limits hallucination and forgetting in long-horizon navigation.
  • Low-level movement is fully decoupled and handled by classical collision-free planning.
  • The zero-shot framework reaches state-of-the-art on R2R-CE and RxR-CE, exceeding some supervised methods.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same map structure could support other embodied tasks that require combining language instructions with physical layout.
  • If map construction from sensors proves robust, the method could reduce reliance on large-scale end-to-end navigation training.
  • Replacing the classical planner with a learned controller while keeping the high-level separation would test whether the core decoupling generalizes.

Load-bearing premise

The VLM can reliably read the spatial layout in the HSGM to pick valid waypoints without hallucinating or losing track of progress.

What would settle it

Provide an accurately constructed HSGM to the VLM on a benchmark task and observe whether it selects geometrically invalid waypoints at a rate higher than classical planners.

Figures

Figures reproduced from arXiv: 2606.00095 by Jingyu Gong, Kailing Li, Liang He, Lijin Yang, Tianwen Qian, Xiaoling Wang, Yuqian Fu.

Figure 1
Figure 1. Figure 1: Our proposed Hierarchical Semantic-Geometric Map (HSGM). The 3D environment is modeled via three maps: Geom￾etry, Semantic, and Decision. It is then rasterized into a 2D BEV Map and projected as visual prompts onto the agent’s view, serv￾ing as the structured visual input for the VLM. 1. Introduction Vision-Language Navigation (VLN) [2] aims to enable an agent to follow natural language instructions and re… view at source ↗
Figure 2
Figure 2. Figure 2: Framework Overview. (1) A LLM decomposes the user instruction into a sequence of subtasks. (2) The agent’s sensor data (RGB-D, pose) is used to dynamically construct the 3D Hierarchical Semantic-Geometric Map. (3) The HSGM is rasterized into a 2D BEV map and projected onto the front view of the agent as visual input for the VLM. (4) The VLM performs CoT reasoning to select a waypoint, and the A ∗ planner c… view at source ↗
Figure 3
Figure 3. Figure 3: Impact of Subtask Decomposition on Success Rate by Instruction Token Count [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: illustrates a typical case study of our framework. First, the LLM decomposes the complex instruction into three sequential, verifiable subtasks. The agent then exe￾cutes them in order. At each decision point, the VLM acts as a high-level semantic planner. It observes the HSGM￾rasterized BEV map and the first-person view (which in￾cludes waypoint visual prompts) to select a high-level ac￾tion. For instance,… view at source ↗
Figure 5
Figure 5. Figure 5: Visualization of Success Cases. We showcase three episodes demonstrating the agent’s capability in multi-room traversal, object-referenced navigation (e.g., “passing the gray couch”), and precise destination identification (e.g., “wait on the white carpet”). 4 [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Failure Case 1: Sequential Counting Error. 1 2 3 Walk down the hall and make a left at the end and stop at the first room on the right. Wait in the entryway of the room with the two couches. Episode begins At step 2, the agent should have continued moving along the hallway until reaching the end (Waypoint 3) before turning left. However, due to an inaccurate understanding of its own position, the VLM incor… view at source ↗
Figure 7
Figure 7. Figure 7: Failure Case 2: Premature Execution. 5 [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗
read the original abstract

Vision-Language Navigation (VLN) enables embodied agents to reach target locations in unseen environments by following language instructions. Despite recent progress with vision-language models (VLMs), a critical semantic-geometric gap remains: while VLMs excel at language and 2D visual understanding, they struggle with 3D spatial reasoning and fail to capture the causal dynamics between actions and spatial transitions, resulting in unreliable navigation, particularly in zero-shot settings. To bridge this gap, we propose a Hierarchical Semantic-Geometric Map (HSGM) that transforms 3D geometric information into a structured representation compatible with VLMs, effectively linking them to the physical world. Specifically, HSGM is represented as a multi-channel top-down map organized into three levels: (1) geometric level that records navigable regions and obstacles, (2) semantic level that represents objects and their relations, and (3) decision level that supports high-level task reasoning and goal selection. During navigation, the VLM acts as a high-level semantic planner, interpreting the spatial layout encoded in the HSGM to select geometrically valid waypoints, while low-level, collision-free movements between waypoints are executed by a classical path-planning algorithm, fully decoupling semantic reasoning from action execution. Additionally, complex instructions are decomposed into subtasks to alleviate the problem of progress forgetting or hallucinating in long-horizon navigation. Extensive experiments on R2R-CE and RxR-CE benchmarks demonstrate that our zero-shot framework achieves state-of-the-art performance and even outperforms several supervised methods. Code is available at https://github.com/Teacher-Tom/HSGM_public.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims to bridge the semantic-geometric gap in zero-shot Vision-Language Navigation by introducing the Hierarchical Semantic-Geometric Map (HSGM), a multi-channel top-down map with geometric, semantic, and decision levels. The VLM uses this map for high-level waypoint selection, decoupled from low-level classical planning, with subtask decomposition for long-horizon tasks. Extensive experiments on R2R-CE and RxR-CE benchmarks are reported to achieve state-of-the-art zero-shot performance, outperforming several supervised methods.

Significance. Should the zero-shot SOTA results hold under scrutiny, this approach would represent a meaningful step toward reliable VLN by structuring 3D information for VLMs without requiring task-specific training. The public release of code at the provided GitHub link supports reproducibility and is a notable strength.

major comments (3)
  1. §3: The precise mechanism by which the three-level HSGM is encoded and presented to the VLM (e.g., as image channels or textual description) is not specified in sufficient detail. This is load-bearing for the central claim, as the VLM's ability to select geometrically valid waypoints without hallucination depends on this interface.
  2. §4: The experimental results on R2R-CE and RxR-CE are presented without error bars, standard deviations across runs, or explicit confirmation of the data splits used. This undermines verification of the claim that the method outperforms several supervised approaches.
  3. §3.3: While subtask decomposition is proposed to address progress forgetting, no quantitative ablation study is provided to demonstrate its necessity or impact on the reported benchmark scores.
minor comments (2)
  1. Abstract: The zero-shot claim could benefit from explicit comparison to the number of parameters or training data in competing supervised methods.
  2. Figure 1: The caption for the HSGM visualization could more clearly label the three levels and their channels for reader clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback and positive assessment of the work's significance. We address each major comment below and commit to incorporating clarifications and additional experiments in the revised manuscript.

read point-by-point responses
  1. Referee: §3: The precise mechanism by which the three-level HSGM is encoded and presented to the VLM (e.g., as image channels or textual description) is not specified in sufficient detail. This is load-bearing for the central claim, as the VLM's ability to select geometrically valid waypoints without hallucination depends on this interface.

    Authors: We agree that further detail on the interface is warranted. HSGM is explicitly constructed as a multi-channel top-down map (geometric, semantic, and decision channels) and is rendered directly as a multi-channel image input to the VLM for visual reasoning. In the revision we will expand §3 with a precise description of the channel encoding, an illustrative figure of the input format, and confirmation that no textual conversion is used, thereby grounding waypoint selection in the explicit spatial data. revision: yes

  2. Referee: §4: The experimental results on R2R-CE and RxR-CE are presented without error bars, standard deviations across runs, or explicit confirmation of the data splits used. This undermines verification of the claim that the method outperforms several supervised approaches.

    Authors: This observation is correct and we will strengthen the experimental reporting. The revised manuscript will include standard deviations and error bars computed over multiple runs, together with explicit confirmation that the standard R2R-CE and RxR-CE data splits were employed. revision: yes

  3. Referee: §3.3: While subtask decomposition is proposed to address progress forgetting, no quantitative ablation study is provided to demonstrate its necessity or impact on the reported benchmark scores.

    Authors: We will add a quantitative ablation study in the revised version that directly measures the effect of subtask decomposition on navigation success rate and SPL on both R2R-CE and RxR-CE, thereby demonstrating its contribution to mitigating progress forgetting. revision: yes

Circularity Check

0 steps flagged

No significant circularity in HSGM architectural proposal

full rationale

The paper proposes an HSGM framework consisting of a three-level top-down map (geometric, semantic, decision) that decouples VLM-based high-level waypoint selection from classical low-level path planning, with subtask decomposition for long-horizon tasks. No equations, fitted parameters, or predictions appear in the provided text that reduce by construction to inputs; performance claims rest on empirical benchmark results rather than derived quantities. No self-citations, uniqueness theorems, or ansatzes are invoked as load-bearing steps. The derivation chain is therefore self-contained as an engineering architecture whose validity is tested externally on R2R-CE and RxR-CE.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claim rests on the domain assumption that a VLM can interpret a structured top-down map for waypoint selection and that subtask decomposition prevents progress forgetting; no free parameters or invented physical entities are mentioned.

axioms (2)
  • domain assumption VLMs can interpret spatial layout in a multi-channel top-down map to select geometrically valid waypoints
    Invoked when stating that the VLM acts as high-level semantic planner
  • domain assumption Decomposing complex instructions into subtasks alleviates progress forgetting or hallucinating
    Stated as an additional technique to support long-horizon navigation
invented entities (1)
  • Hierarchical Semantic-Geometric Map (HSGM) no independent evidence
    purpose: Transforms 3D geometric information into structured representation compatible with VLMs
    New map representation introduced with three explicit levels

pith-pipeline@v0.9.1-grok · 5852 in / 1441 out tokens · 27185 ms · 2026-06-29T22:30:52.363342+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

46 extracted references · 11 canonical work pages · 3 internal anchors

  1. [1]

    Etpnav: Evolving topo- logical planning for vision-language navigation in continu- ous environments.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024

    Dong An, Hanqing Wang, Wenguan Wang, Zun Wang, Yan Huang, Keji He, and Liang Wang. Etpnav: Evolving topo- logical planning for vision-language navigation in continu- ous environments.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024. 3, 6

  2. [2]

    Vision-and-language navigation: In- terpreting visually-grounded navigation instructions in real environments

    Peter Anderson, Qi Wu, Damien Teney, Jake Bruce, Mark Johnson, Niko S ¨underhauf, Ian Reid, Stephen Gould, and Anton Van Den Hengel. Vision-and-language navigation: In- terpreting visually-grounded navigation instructions in real environments. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 3674–3683,

  3. [3]

    Sim- to-real transfer for vision-and-language navigation

    Peter Anderson, Ayush Shrivastava, Joanne Truong, Arjun Majumdar, Devi Parikh, Dhruv Batra, and Stefan Lee. Sim- to-real transfer for vision-and-language navigation. InCon- ference on Robot Learning, pages 671–681. PMLR, 2021. 2

  4. [4]

    Matterport3d: Learning from rgb-d data in indoor environments

    Angel Chang, Angela Dai, Thomas Funkhouser, Maciej Hal- ber, Matthias Niebner, Manolis Savva, Shuran Song, Andy Zeng, and Yinda Zhang. Matterport3d: Learning from rgb-d data in indoor environments. In2017 International Confer- ence on 3D Vision (3DV), pages 667–676. IEEE Computer Society, 2017. 6

  5. [5]

    Mapgpt: Map-guided prompt- ing with adaptive path planning for vision-and-language nav- igation

    Jiaqi Chen, Bingqian Lin, Ran Xu, Zhenhua Chai, Xiaodan Liang, and Kwan-Yee Wong. Mapgpt: Map-guided prompt- ing with adaptive path planning for vision-and-language nav- igation. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9796–9810, 2024. 2

  6. [6]

    Affordances-oriented planning using foundation models for continuous vision- language navigation

    Jiaqi Chen, Bingqian Lin, Xinmin Liu, Lin Ma, Xiao- dan Liang, and Kwan-Yee K Wong. Affordances-oriented planning using foundation models for continuous vision- language navigation. InProceedings of the AAAI Conference on Artificial Intelligence, pages 23568–23576, 2025. 2, 3, 6, 7

  7. [7]

    Constraint-aware zero-shot vision-language navigation in continuous environ- ments.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

    Kehan Chen, Dong An, Yan Huang, Rongtao Xu, Yifei Su, Yonggen Ling, Ian Reid, and Liang Wang. Constraint-aware zero-shot vision-language navigation in continuous environ- ments.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025. 3, 5, 6

  8. [8]

    a2 nav: Action-aware zero-shot robot navigation by exploit- ing vision-and-language ability of foundation models.arXiv preprint arXiv:2308.07997, 2023

    Peihao Chen, Xinyu Sun, Hongyan Zhi, Runhao Zeng, Thomas H Li, Gaowen Liu, Mingkui Tan, and Chuang Gan. a2 nav: Action-aware zero-shot robot navigation by exploit- ing vision-and-language ability of foundation models.arXiv preprint arXiv:2308.07997, 2023. 6

  9. [9]

    Think global, act lo- cal: Dual-scale graph transformer for vision-and-language navigation

    Shizhe Chen, Pierre-Louis Guhur, Makarand Tapaswi, Cordelia Schmid, and Ivan Laptev. Think global, act lo- cal: Dual-scale graph transformer for vision-and-language navigation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16537– 16547, 2022. 2

  10. [10]

    Objectrelator: Enabling cross-view object rela- tion understanding across ego-centric and exo-centric per- spectives

    Yuqian Fu, Runze Wang, Bin Ren, Guolei Sun, Biao Gong, Yanwei Fu, Danda Pani Paudel, Xuanjing Huang, and Luc Van Gool. Objectrelator: Enabling cross-view object rela- tion understanding across ego-centric and exo-centric per- spectives. InProceedings of the IEEE/CVF International Conference on Computer Vision, 2025. 1

  11. [11]

    Bridg- ing the gap between learning in discrete and continuous en- vironments for vision-and-language navigation

    Yicong Hong, Zun Wang, Qi Wu, and Stephen Gould. Bridg- ing the gap between learning in discrete and continuous en- vironments for vision-and-language navigation. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 15439–15449, 2022. 2, 3, 6

  12. [12]

    Learning navigational visual representations with semantic map su- pervision

    Yicong Hong, Yang Zhou, Ruiyi Zhang, Franck Dernon- court, Trung Bui, Stephen Gould, and Hao Tan. Learning navigational visual representations with semantic map su- pervision. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 3055–3067, 2023. 2

  13. [13]

    Semantically-aware spatio-temporal rea- soning agent for vision-and-language navigation in continu- ous environments

    Muhammad Zubair Irshad, Niluthpol Chowdhury Mithun, Zachary Seymour, Han-Pang Chiu, Supun Samarasekera, and Rakesh Kumar. Semantically-aware spatio-temporal rea- soning agent for vision-and-language navigation in continu- ous environments. In2022 26th International conference on pattern recognition (ICPR), pages 4065–4071. IEEE, 2022. 6

  14. [14]

    Beyond the nav-graph: Vision-and-language navigation in continuous environments

    Jacob Krantz, Erik Wijmans, Arjun Majumdar, Dhruv Batra, and Stefan Lee. Beyond the nav-graph: Vision-and-language navigation in continuous environments. InEuropean Confer- ence on Computer Vision, pages 104–120. Springer, 2020. 2, 6

  15. [15]

    Waypoint models for instruction- guided navigation in continuous environments

    Jacob Krantz, Aaron Gokaslan, Dhruv Batra, Stefan Lee, and Oleksandr Maksymets. Waypoint models for instruction- guided navigation in continuous environments. InProceed- ings of the IEEE/CVF International Conference on Com- puter Vision, pages 15162–15171, 2021. 3

  16. [16]

    Room-across-room: Multilingual vision- and-language navigation with dense spatiotemporal ground- ing

    Alexander Ku, Peter Anderson, Roma Patel, Eugene Ie, and Jason Baldridge. Room-across-room: Multilingual vision- and-language navigation with dense spatiotemporal ground- ing. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4392–4412, 2020. 6

  17. [17]

    Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InIn- ternational conference on machine learning, pages 19730– 19742. PMLR, 2023. 1

  18. [18]

    Grounding dino: Marrying dino with grounded pre-training for open-set object detection

    Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. InEuro- pean conference on computer vision, pages 38–55. Springer,

  19. [19]

    Instructnav: Zero-shot system for generic 9 instruction navigation in unexplored environment

    Yuxing Long, Wenzhe Cai, Hongcheng Wang, Guanqi Zhan, and Hao Dong. Instructnav: Zero-shot system for generic 9 instruction navigation in unexplored environment. InCon- ference on Robot Learning, pages 2049–2060. PMLR, 2025. 2, 3, 4, 6, 1

  20. [20]

    Pivot: iterative visual prompting elicits actionable knowledge for vlms

    Soroush Nasiriany, Fei Xia, Wenhao Yu, Ted Xiao, Jacky Liang, Ishita Dasgupta, Annie Xie, Danny Driess, Ayzaan Wahid, Zhuo Xu, et al. Pivot: iterative visual prompting elicits actionable knowledge for vlms. InProceedings of the 41st International Conference on Machine Learning, pages 37321–37341, 2024. 1

  21. [21]

    Gpt4scene: Understand 3d scenes from videos with vision-language models,

    Zhangyang Qi, Zhixiong Zhang, Ye Fang, Jiaqi Wang, and Hengshuang Zhao. Gpt4scene: Understand 3d scenes from videos with vision-language models.arXiv preprint arXiv:2501.01428, 2025. 2

  22. [22]

    Vln-r1: Vision-language navigation via reinforcement fine-tuning.arXiv preprint arXiv:2506.17221,

    Zhangyang Qi, Zhixiong Zhang, Yizhou Yu, Jiaqi Wang, and Hengshuang Zhao. Vln-r1: Vision-language navigation via reinforcement fine-tuning.arXiv preprint arXiv:2506.17221,

  23. [23]

    Open-nav: Explor- ing zero-shot vision-and-language navigation in continuous environment with open-source llms

    Yanyuan Qiao, Wenqi Lyu, Hui Wang, Zixu Wang, Zerui Li, Yuan Zhang, Mingkui Tan, and Qi Wu. Open-nav: Explor- ing zero-shot vision-and-language navigation in continuous environment with open-source llms. In2025 IEEE Inter- national Conference on Robotics and Automation (ICRA), pages 6710–6717. IEEE, 2025. 6

  24. [24]

    Habitat: A platform for embodied ai research

    Manolis Savva, Abhishek Kadian, Oleksandr Maksymets, Yili Zhao, Erik Wijmans, Bhavana Jain, Julian Straub, Jia Liu, Vladlen Koltun, Jitendra Malik, et al. Habitat: A platform for embodied ai research. InProceedings of the IEEE/CVF international conference on computer vision, pages 9339–9347, 2019. 2, 6

  25. [25]

    Smartway: Enhanced waypoint prediction and backtracking for zero-shot vision- and-language navigation

    Xiangyu Shi, Zerui Li, Wenqi Lyu, Jiatong Xia, Feras Day- oub, Yanyuan Qiao, and Qi Wu. Smartway: Enhanced waypoint prediction and backtracking for zero-shot vision- and-language navigation. In2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 16923–16930. IEEE, 2025. 3, 6, 7

  26. [26]

    OpenAI GPT-5 System Card

    Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267, 2025. 6

  27. [27]

    Towards long-horizon vision- language navigation: Platform, benchmark and method

    Xinshuai Song, Weixing Chen, Yang Liu, Weikai Chen, Guanbin Li, and Liang Lin. Towards long-horizon vision- language navigation: Platform, benchmark and method. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 12078–12088, 2025. 5

  28. [28]

    Yoloe: Real-time seeing anything

    Ao Wang, Lihao Liu, Hui Chen, Zijia Lin, Jungong Han, and Guiguang Ding. Yoloe: Real-time seeing anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 24591–24602, 2025. 4

  29. [29]

    Jiawei Wang, Teng Wang, Lele Xu, Zichen He, and Changyin Sun. Discovering intrinsic subgoals for vision- and-language navigation via hierarchical reinforcement learning.IEEE Transactions on Neural Networks and Learn- ing Systems, 36(4):6516–6528, 2024. 2

  30. [30]

    Dreamnav: A trajectory-based imaginative frame- work for zero-shot vision-and-language navigation.arXiv preprint arXiv:2509.11197, 2025

    Yunheng Wang, Yuetong Fang, Taowen Wang, Yixiao Feng, Yawen Tan, Shuning Zhang, Peiran Liu, Yiding Ji, and Ren- jing Xu. Dreamnav: A trajectory-based imaginative frame- work for zero-shot vision-and-language navigation.arXiv preprint arXiv:2509.11197, 2025. 3, 6, 7

  31. [31]

    Scaling data generation in vision-and-language navigation

    Zun Wang, Jialu Li, Yicong Hong, Yi Wang, Qi Wu, Mohit Bansal, Stephen Gould, Hao Tan, and Yu Qiao. Scaling data generation in vision-and-language navigation. InProceed- ings of the IEEE/CVF international conference on computer vision, pages 12009–12020, 2023. 2

  32. [32]

    Gridmm: Grid memory map for vision- and-language navigation

    Zihan Wang, Xiangyang Li, Jiahao Yang, Yeqi Liu, and Shuqiang Jiang. Gridmm: Grid memory map for vision- and-language navigation. InProceedings of the IEEE/CVF International conference on computer vision, pages 15625– 15636, 2023. 2

  33. [33]

    Dynam3d: Dynamic layered 3d tokens empower vlm for vision-and- language navigation.arXiv preprint arXiv:2505.11383,

    Zihan Wang, Seungjun Lee, and Gim Hee Lee. Dynam3d: Dynamic layered 3d tokens empower vlm for vision-and- language navigation.arXiv preprint arXiv:2505.11383,

  34. [34]

    Vlfm: Vision-language frontier maps for zero-shot semantic navigation

    Naoki Yokoyama, Sehoon Ha, Dhruv Batra, Jiuguang Wang, and Bernadette Bucher. Vlfm: Vision-language frontier maps for zero-shot semantic navigation. In2024 IEEE In- ternational Conference on Robotics and Automation (ICRA), pages 42–48. IEEE, 2024. 3, 1

  35. [35]

    How far are vlms from visual spatial in- telligence? a benchmark-driven perspective.arXiv preprint arXiv:2509.18905, 2025

    Songsong Yu, Yuxin Chen, Hao Ju, Lianjie Jia, Fuxi Zhang, Shaofei Huang, Yuhan Wu, Rundi Cui, Binghao Ran, Za- ibin Zhang, et al. How far are vlms from visual spatial in- telligence? a benchmark-driven perspective.arXiv preprint arXiv:2509.18905, 2025. 2

  36. [36]

    Mc-gpt: Empowering vision-and-language navigation with memory map and reasoning chains.arXiv preprint arXiv:2405.10620,

    Zhaohuan Zhan, Lisha Yu, Sijie Yu, and Guang Tan. Mc-gpt: Empowering vision-and-language navigation with memory map and reasoning chains.arXiv preprint arXiv:2405.10620,

  37. [37]

    Vision-language models for vision tasks: A survey.IEEE transactions on pattern analysis and machine intelligence,

    Jingyi Zhang, Jiaxing Huang, Sheng Jin, and Shijian Lu. Vision-language models for vision tasks: A survey.IEEE transactions on pattern analysis and machine intelligence,

  38. [38]

    Uni-NaVid: A Video-based Vision-Language-Action Model for Unifying Embodied Navigation Tasks

    Jiazhao Zhang, Kunyu Wang, Shaoan Wang, Minghan Li, Haoran Liu, Songlin Wei, Zhongyuan Wang, Zhizheng Zhang, and He Wang. Uni-navid: A video-based vision- language-action model for unifying embodied navigation tasks.arXiv preprint arXiv:2412.06224, 2024. 1

  39. [39]

    NaVid: Video-based VLM Plans the Next Step for Vision-and-Language Navigation

    Jiazhao Zhang, Kunyu Wang, Rongtao Xu, Gengze Zhou, Yicong Hong, Xiaomeng Fang, Qi Wu, Zhizheng Zhang, and He Wang. Navid: Video-based vlm plans the next step for vision-and-language navigation.arXiv preprint arXiv:2402.15852, 2024. 2, 3, 6, 7, 1

  40. [40]

    MapNav: A novel memory representation via annotated semantic maps for VLM-based vision-and-language navigation

    Lingfeng Zhang, Xiaoshuai Hao, Qinwen Xu, Qiang Zhang, Xinyao Zhang, Pengwei Wang, Jing Zhang, Zhongyuan Wang, Shanghang Zhang, and Renjing Xu. MapNav: A novel memory representation via annotated semantic maps for VLM-based vision-and-language navigation. InPro- ceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1...

  41. [41]

    Apexnav: An adap- tive exploration strategy for zero-shot object navigation with target-centric semantic fusion.IEEE Robotics and Automa- tion Letters, 2025

    Mingjie Zhang, Yuheng Du, Chengkai Wu, Jinni Zhou, Zhenchao Qi, Jun Ma, and Boyu Zhou. Apexnav: An adap- tive exploration strategy for zero-shot object navigation with target-centric semantic fusion.IEEE Robotics and Automa- tion Letters, 2025. 1

  42. [42]

    Multimodal spatial reasoning in the large model era: A survey and benchmarks

    Xu Zheng, Zihao Dongfang, Lutao Jiang, Boyuan Zheng, Yulong Guo, Zhenquan Zhang, Giuliano Albanese, Runyi 10 Yang, Mengjiao Ma, Zixin Zhang, et al. Multimodal spatial reasoning in the large model era: A survey and benchmarks. arXiv preprint arXiv:2510.25760, 2025. 1

  43. [43]

    Navgpt: Explicit reasoning in vision-and-language navigation with large lan- guage models

    Gengze Zhou, Yicong Hong, and Qi Wu. Navgpt: Explicit reasoning in vision-and-language navigation with large lan- guage models. InProceedings of the AAAI Conference on Artificial Intelligence, pages 7641–7649, 2024. 2 11 Bridging the 2D-3D Gap: A Hierarchical Semantic-Geometric Map for Vision Language Navigation Supplementary Material In this supplementar...

  44. [44]

    More Implementation Details 6.1. Implementation of Multi-Floor Navigation To enable robust navigation in complex, multi-stair envi- ronments, we extend the HSGM framework to manage in- dependent map representations for each floor. LetF= {f0, . . . , fn}denote the set of floors. For each floorf i, the system maintains a dedicated map instanceM(i) containin...

  45. [45]

    Navigate to the target object [object] and get as close to it as possible

    Supplementary Results 7.1. Performance on Object Goal Navigation To further assess the versatility and generalization capa- bility of our framework beyond instruction-following, we evaluated HSGM on the Object Goal Navigation (Object- Nav) task using the challenging Habitat-Matterport 3D (HM3D) [?] dataset. In this setting, the complex narra- tive instruc...

  46. [46]

    second door on the left

    More Visualization Results 8.1. Success Cases As illustrated in Figure 5, our framework demonstrates ro- bust performance across diverse and complex indoor en- vironments. The visualization underscores the system’s ability to decompose complex natural language instructions into manageable sequential subtasks, providing a clear and structured roadmap for l...