pith. sign in

arxiv: 2605.17249 · v1 · pith:H7G66W7Pnew · submitted 2026-05-17 · 💻 cs.RO

SEDualVLN: A Spatially-Enhanced Dual-System for Vision-Language Navigation

Pith reviewed 2026-05-20 13:27 UTC · model grok-4.3

classification 💻 cs.RO
keywords vision-language navigationdual-systemspatial awareness3D mappingmultimodal large language modelwaypoint planningunseen environmentsVLN-CE
0
0 comments X

The pith

A dual-system VLN framework pairs a fast spatially-aware vision-language model for actions with a slow MLLM planner using real-time 3D maps to reach state-of-the-art results on unseen environments.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to establish that splitting navigation work between two cooperating systems can overcome the limits of both pure end-to-end models and pure zero-shot planners. System 1 adds global and local spatial awareness to a trained vision-language model so it can generate actions quickly while keeping a sense of direction. System 2 builds a live 3D map and lets a multimodal language model plan waypoints by looking at top-down views and rendered paths. The two systems then coordinate in a fast-slow loop to finish long language-guided trips. A reader would care if this mix really reduces getting lost on extended routes and cuts down on slow reasoning time in places the agent has never seen before.

Core claim

The central claim is that a spatially-enhanced dual-system VLN framework succeeds by letting System 1, a vision-language model augmented with global and local spatial awareness, generate actions rapidly while System 2 integrates a multimodal large language model with a mapping module that plans waypoints from top-down 3D map views and streams of rendered path images, with the two systems cooperating through a fast-slow coordinated approach to complete navigation tasks and achieve state-of-the-art performance on VLN-CE benchmarks.

What carries the argument

The spatially-enhanced dual-system in which System 1 supplies quick actions from a vision-language model with added spatial awareness and System 2 supplies waypoint plans from an MLLM operating on top-down 3D maps and path images.

If this is right

  • The approach extends reliable navigation to longer trajectories where end-to-end models typically lose coherence.
  • Spatial enhancements in both systems improve grounding for planning compared with pure zero-shot MLLM pipelines.
  • Coordination between the systems reduces overall reasoning time while preserving generalization to new scenes.
  • Ablation results indicate that removing either the global-local awareness or the 3D map module lowers final performance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Real-time map updates in System 2 could let the agent recover from temporary obstacles by replanning without restarting the entire task.
  • The same fast-slow split might transfer to other language-guided embodied tasks such as object search or rearrangement in homes.
  • Rendered path images could be varied at inference time to let the planner preview alternate routes before committing to a waypoint.

Load-bearing premise

The fast action system and the slow planning system can coordinate without producing conflicts or deadlocks when the agent faces environments it has not seen during training.

What would settle it

Deploy the agent in a long-horizon unseen test environment and measure whether the success rate falls below current single-system baselines or whether the agent frequently stalls while the two systems resolve differing suggestions.

Figures

Figures reproduced from arXiv: 2605.17249 by Hailong Huang, Haoang Li, Haoyang Yang, Jingzhi Huang, Junkai Huang, Wenxuan Song, Yi Wang.

Figure 1
Figure 1. Figure 1: Overview of SEDualVLN. System 1 (orange) generates fast, low-level actions from RGB [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Framework of System 1. Given input RGB image streams and language instructions, the [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: System 2 Workflow. It consists of three stages: Mapping, which builds a 3D map and 2D [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Comparative experiment. SEDualVLN did not take a wrong turn at the beginning like [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Case study. Navigation and mapping visualization of our SEDualVLN. [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: The specific details of the MLLM reasoning are presented in the Case study. [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: The specific details of the MLLM reasoning are presented in the Case study. [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗
read the original abstract

Vision-Language Navigation (VLN) approaches have currently followed two primary paradigms: the end-to-end Vision-Language Model (VLM) policy fine-tuned on navigation trajectories to directly predict actions, and the zero-shot modular pipeline integrating pre-trained Multimodal Large Language Model (MLLM) for training-free generalization to unseen environments. However, end-to-end methods struggle with long-horizon navigation and lack dynamic reasoning, whereas zero-shot methods are constrained by limited spatial grounding for reliable planning and also require substantial reasoning time. To bridge this gap, we introduce SEDualVLN, a spatially-enhanced dual-system VLN framework. System 1 is a VLM model enhanced with both global and local spatial awareness, used for action generation. System 2 integrates a general MLLM with a mapping module, wherein the MLLM plans waypoints by leveraging top-down views of the real-time 3D map alongside streams of rendered path images. Both systems leverage different forms of spatial enhancement to cultivate the agent's sense of direction in VLN tasks. Ultimately, they cooperate to complete the navigation task through a fast-slow coordinated approach. SEDualVLN achieves state-of-the-art performance on VLN-CE benchmarks, and further ablation studies demonstrate the effectiveness of each system and module.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The manuscript introduces SEDualVLN, a dual-system framework for Vision-Language Navigation that pairs a fast VLM-based System 1 (enhanced with global and local spatial awareness for direct action generation) with a slower System 2 (MLLM plus real-time 3D mapping module that plans waypoints from top-down views and rendered path images). The two systems cooperate via a fast-slow coordinated approach to address limitations of pure end-to-end and zero-shot methods, claiming state-of-the-art results on VLN-CE benchmarks together with ablation studies validating each component.

Significance. If the empirical claims hold, the work offers a practical bridge between reactive end-to-end policies and modular planning, potentially improving long-horizon reliability in unseen environments through explicit spatial enhancements. The dual-system design and emphasis on cultivating directional awareness constitute a clear incremental contribution to VLN-CE.

major comments (1)
  1. [Abstract and dual-system cooperation description] The central claim of reliable navigation rests on the fast-slow coordination between System 1 action generation and System 2 waypoint planning, yet the manuscript provides only a high-level description of their cooperation. No priority rules, override conditions, deadlock detection mechanism, or fusion procedure for reconciling incompatible proposals (e.g., when rendered path images and 3D map updates disagree) are specified. This omission directly affects the weakest assumption identified for unseen long-horizon episodes.
minor comments (1)
  1. [Abstract] Quantitative results, error bars, and exact VLN-CE dataset splits should be stated explicitly in the abstract or early results section to allow immediate verification of the SOTA claim.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the thoughtful and constructive feedback on our manuscript. We have carefully addressed the major comment concerning the dual-system cooperation mechanism and revised the paper to provide greater clarity and detail on this aspect.

read point-by-point responses
  1. Referee: [Abstract and dual-system cooperation description] The central claim of reliable navigation rests on the fast-slow coordination between System 1 action generation and System 2 waypoint planning, yet the manuscript provides only a high-level description of their cooperation. No priority rules, override conditions, deadlock detection mechanism, or fusion procedure for reconciling incompatible proposals (e.g., when rendered path images and 3D map updates disagree) are specified. This omission directly affects the weakest assumption identified for unseen long-horizon episodes.

    Authors: We agree that the original manuscript described the fast-slow coordination at a high level, which limited the transparency of how the systems interact in practice. In the revised version, we have added a new subsection (Section 3.4) that explicitly details the coordination protocol. System 1 serves as the default reactive controller for low-latency action generation. System 2 intervenes at fixed intervals or upon detecting map inconsistencies (e.g., via rendered path image mismatches with the 3D map). Priority rules assign precedence to System 2 for waypoint overrides when long-horizon discrepancies exceed a confidence threshold from the MLLM. A simple deadlock detector monitors consecutive failed actions from System 1 and triggers a System 2 replan. The fusion procedure reconciles proposals by selecting the System 2 waypoint if the rendered path deviates beyond a spatial threshold, otherwise blending compatible actions. These additions directly strengthen the description for long-horizon unseen episodes. revision: yes

Circularity Check

0 steps flagged

No circularity: architectural proposal without equations or self-referential derivations

full rationale

The paper introduces SEDualVLN as a dual-system architecture (System 1 VLM with spatial awareness for fast actions; System 2 MLLM with 3D mapping for slower waypoints) that cooperates via an unspecified fast-slow approach to achieve SOTA on VLN-CE. No equations, fitted parameters, uniqueness theorems, or self-citations appear in the provided text as load-bearing elements of any derivation. The central claims rest on empirical benchmark results and ablation studies rather than reducing to quantities defined by the authors' own prior constructs or by construction. This is a standard non-circular architectural contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the empirical effectiveness of the described spatial modules and their coordination. No explicit free parameters, axioms, or invented entities are stated in the abstract; the work is an architectural proposal evaluated on standard VLN-CE benchmarks.

pith-pipeline@v0.9.0 · 5775 in / 1118 out tokens · 46523 ms · 2026-05-20T13:27:03.359585+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

49 extracted references · 49 canonical work pages · 5 internal anchors

  1. [1]

    Vision-and-language navigation: A survey of tasks, methods, and future directions,

    J. Gu, E. Stefani, Q. Wu, J. Thomason, and X. Wang, “Vision-and-language navigation: A survey of tasks, methods, and future directions,” inProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (ACL), S. Muresan, P. Nakov, and A. Villavicencio, Eds., 2022, pp. 7606–7623

  2. [2]

    Vision-and-language navigation today and tomorrow: A survey in the era of foundation models,

    Y . Zhang, Z. Ma, J. Li, Y . Qiao, Z. Wang, J. Chai, Q. Wu, M. Bansal, and P. Kordjamshidi, “Vision-and-language navigation today and tomorrow: A survey in the era of foundation models,” 2024

  3. [3]

    Video-LLaV A: Learning united visual representation by alignment before projection,

    B. Lin, Y . Ye, B. Zhu, J. Cui, M. Ning, P. Jin, and L. Yuan, “Video-LLaV A: Learning united visual representation by alignment before projection,” inProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024, pp. 5971–5984

  4. [4]

    LLaVA-Video: Video Instruction Tuning With Synthetic Data

    Y . Zhang, J. Wu, W. Li, B. Li, Z. Ma, Z. Liu, and C. Li, “Video instruction tuning with synthetic data, 2024,”arXiv preprint arXiv:2410.02713, 2024

  5. [5]

    A survey on multimodal large language models,

    S. Yin, C. Fu, S. Zhao, K. Li, X. Sun, T. Xu, and E. Chen, “A survey on multimodal large language models,”National Science Review, 2024

  6. [6]

    Uni-NaVid: A Video-based Vision-Language-Action Model for Unifying Embodied Navigation Tasks

    J. Zhang, K. Wang, S. Wang, M. Li, H. Liu, S. Wei, Z. Wang, Z. Zhang, and H. Wang, “Uni- navid: A video-based vision-language-action model for unifying embodied navigation tasks,” arXiv preprint arXiv:2412.06224, 2024

  7. [7]

    NaVILA: Legged Robot Vision-Language-Action Model for Naviga- tion

    A.-C. Cheng, Y . Ji, Z. Yang, Z. Gongye, X. Zou, J. Kautz, E. Bıyık, H. Yin, S. Liu, and X. Wang, “Navila: Legged robot vision-language-action model for navigation,”arXiv preprint arXiv:2412.04453, 2024

  8. [8]

    Towards learning a generalist model for embodied navigation,

    D. Zheng, S. Huang, L. Zhao, Y . Zhong, and L. Wang, “Towards learning a generalist model for embodied navigation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 13 624–13 634

  9. [9]

    NaVid: Video-based VLM Plans the Next Step for Vision-and-Language Navigation

    J. Zhang, K. Wang, R. Xu, G. Zhou, Y . Hong, X. Fang, Q. Wu, Z. Zhang, and H. Wang, “Navid: Video-based vlm plans the next step for vision-and-language navigation,”arXiv preprint arXiv:2402.15852, 2024

  10. [10]

    End-to-end navigation with vision language models: Transforming spatial reasoning into question- answering.arXiv preprint arXiv:2411.05755, 2024

    D. Goetting, H. G. Singh, and A. Loquercio, “End-to-end navigation with vision language mod- els: Transforming spatial reasoning into question-answering,”arXiv preprint arXiv:2411.05755, 2024

  11. [11]

    Towards long-horizon vision-language navigation: Platform, benchmark and method,

    X. Song, W. Chen, Y . Liu, W. Chen, G. Li, and L. Lin, “Towards long-horizon vision-language navigation: Platform, benchmark and method,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025, pp. 12 078–12 088

  12. [12]

    Streamvln: Streaming vision-and- language navigation via slowfast context modeling.arXiv preprint arXiv:2507.05240, 2025

    M. Wei, C. Wan, X. Yu, T. Wang, Y . Yang, X. Mao, C. Zhu, W. Cai, H. Wang, Y . Chen, X. Liu, and J. Pang, “Streamvln: Streaming vision-and-language navigation via slowfast context modeling,”arXiv preprint arXiv:2507.05240, 2025

  13. [13]

    Janusvln: Decoupling semantics and spatiality with dual implicit memory for vision-language navigation.arXiv preprint arXiv:2509.22548, 2025

    S. Zeng, D. Qi, X. Chang, F. Xiong, S. Xie, X. Wu, S. Liang, M. Xu, X. Wei, and N. Guo, “Janusvln: Decoupling semantics and spatiality with dual implicit memory for vision-language navigation,”arXiv preprint arXiv:2509.22548, 2025. 10

  14. [14]

    Dygeovln: Infusing dynamic geometry foundation model into vision-language navigation,

    X. Liu, H. Zheng, J. Jeong, M. Yoon, L. Zhao, Z. Zhong, H. Li, and S.-E. Yoon, “Dygeovln: Infusing dynamic geometry foundation model into vision-language navigation,”arXiv preprint arXiv:2603.21269, 2026

  15. [15]

    P3nav: End-to-end perception, prediction and planning for vision-and-language navigation,

    T. Li, W. Chen, H. Xu, X. Zheng, and H. Li, “P3nav: End-to-end perception, prediction and planning for vision-and-language navigation,”arXiv preprint arXiv:2603.17459, 2026

  16. [16]

    Msnav: Zero-shot vision-and- language navigation with dynamic memory and llm spatial reasoning,

    C. Liu, Z. Zhou, J. Zhang, M. Zhang, S. Huang, and H. Duan, “Msnav: Zero-shot vision-and- language navigation with dynamic memory and llm spatial reasoning,” inIEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2026, pp. 20 112–20 116

  17. [17]

    Spatial-vln: Zero-shot vision-and-language navigation with explicit spatial perception and exploration.arXiv preprint arXiv:2601.12766,

    L. Yue, Y . Fan, S. Lian, Y . Zhao, J. Yu, L. Xie, and F. Zhang, “Spatial-vln: Zero-shot vision- and-language navigation with explicit spatial perception and exploration,”arXiv preprint arXiv:2601.12766, 2026

  18. [18]

    Open-nav: Exploring zero-shot vision-and-language navigation in continuous environment with open-source llms,

    Y . Qiao, W. Lyu, H. Wang, Z. Wang, Z. Li, Y . Zhang, M. Tan, and Q. Wu, “Open-nav: Exploring zero-shot vision-and-language navigation in continuous environment with open-source llms,” in IEEE International Conference on Robotics and Automation (ICRA), 2025, pp. 6710–6717

  19. [19]

    Constraint-aware zero- shot vision-language navigation in continuous environments,

    K. Chen, D. An, Y . Huang, R. Xu, Y . Su, Y . Ling, I. Reid, and L. Wang, “Constraint-aware zero- shot vision-language navigation in continuous environments,”IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), vol. 47, no. 11, pp. 10 441–10 456, 2025

  20. [20]

    Spatialnav: Leveraging spatial scene graphs for zero-shot vision-and-language navigation,

    J. Zhang, Z. Li, S. Wang, X. Shi, Z. Wei, and Q. Wu, “Spatialnav: Leveraging spatial scene graphs for zero-shot vision-and-language navigation,”arXiv preprint arXiv:2601.06806, 2026

  21. [21]

    Spatialgpt: Zero-shot vision-and-language navigation via spatial cot over structured spatial memory,

    Z. Jiang and X. Wang, “Spatialgpt: Zero-shot vision-and-language navigation via spatial cot over structured spatial memory,” inProceedings of the 33rd ACM International Conference on Advances in Geographic Information Systems, 2025, p. 423–435

  22. [22]

    Himemvln: Enhancing reliability of open-source zero-shot vision-and-language navigation with hierarchical memory system,

    K. Lyu, K. Wu, P. Li, X. Hu, Q. Si, C. Miao, N. Yang, Z. Wang, L. Xiao, L. Hu, J. Sun, and C. Hao, “Himemvln: Enhancing reliability of open-source zero-shot vision-and-language navigation with hierarchical memory system,”arXiv preprint arXiv:2603.14807, 2026

  23. [23]

    Instructnav: Zero-shot system for generic instruction navigation in unexplored environment.arXiv preprint arXiv:2406.04882, 2024

    Y . Long, W. Cai, H. Wang, G. Zhan, and H. Dong, “Instructnav: Zero-shot system for generic instruction navigation in unexplored environment,”arXiv preprint arXiv:2406.04882, 2024

  24. [24]

    Dreamnav: A trajectory-based imaginative framework for zero-shot vision-and-language navigation,

    Y . Wang, Y . Fang, T. Wang, Y . Feng, Y . Tan, S. Zhang, P. Liu, Y . Ji, and R. Xu, “Dreamnav: A trajectory-based imaginative framework for zero-shot vision-and-language navigation,” 2025

  25. [25]

    Ground slow, move fast: A dual-system foundation model for generalizable vision-and-language navigation,

    M. Wei, C. Wan, J. Peng, X. Yu, Y . Yang, D. Feng, W. Cai, C. Zhu, T. Wang, J. Pang, and X. Liu, “Ground slow, move fast: A dual-system foundation model for generalizable vision-and-language navigation,” 2025

  26. [26]

    Run, ruminate, and regulate: A dual-process thinking system for vision-and-language navigation,

    Y . Zhong, Z. Zhang, R. Zhang, L. Huang, H. Gao, S. Wang, D. Li, R. Han, J. Guo, S. Peng, D. Huang, and Y . Chen, “Run, ruminate, and regulate: A dual-process thinking system for vision-and-language navigation,” inAAAI, 2026, pp. 18 845–18 854

  27. [27]

    Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments,

    P. Anderson, Q. Wu, D. Teney, J. Bruce, M. Johnson, N. Sünderhauf, I. Reid, S. Gould, and A. van den Hengel, “Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2018

  28. [28]

    Reverie: Remote embodied visual referring expression in real indoor environments,

    Y . Qi, Q. Wu, P. Anderson, X. Wang, W. Y . Wang, C. Shen, and A. v. d. Hengel, “Reverie: Remote embodied visual referring expression in real indoor environments,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020

  29. [29]

    Beyond the nav-graph: Vision-and- language navigation in continuous environments,

    J. Krantz, E. Wijmans, A. Majumdar, D. Batra, and S. Lee, “Beyond the nav-graph: Vision-and- language navigation in continuous environments,” inEuropean Conference on Computer Vision (ECCV). Springer, 2020, pp. 104–120

  30. [30]

    Room-across-room: Multilingual vision-and-language navigation with dense spatiotemporal grounding,

    A. Ku, P. Anderson, R. Patel, E. Ie, and J. Baldridge, “Room-across-room: Multilingual vision-and-language navigation with dense spatiotemporal grounding,” inProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020, pp. 4392–4412. 11

  31. [31]

    Learning to navigate unseen environments: Back translation with environmental dropout,

    H. Tan, L. Yu, and M. Bansal, “Learning to navigate unseen environments: Back translation with environmental dropout,” inProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), 2019, pp. 2610–2621

  32. [32]

    Vln bert: A recurrent vision-and- language bert for navigation,

    Y . Hong, Q. Wu, Y . Qi, C. Rodriguez-Opazo, and S. Gould, “Vln bert: A recurrent vision-and- language bert for navigation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 1643–1653

  33. [33]

    A reduction of imitation learning and structured prediction to no-regret online learning,

    S. Ross, G. Gordon, and D. Bagnell, “A reduction of imitation learning and structured prediction to no-regret online learning,” inProceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics. PMLR, 2011, pp. 627–635

  34. [34]

    Gpt-4o system card,

    OpenAI, “Gpt-4o system card,” 2024

  35. [35]

    Think global, act local: Dual- scale graph transformer for vision-and-language navigation,

    S. Chen, P.-L. Guhur, M. Tapaswi, C. Schmid, and I. Laptev, “Think global, act local: Dual- scale graph transformer for vision-and-language navigation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 16 537–16 547

  36. [36]

    Geovln: Learning geometry-enhanced visual representation with slot attention for vision-and-language navigation,

    J. Huo, Q. Sun, B. Jiang, H. Lin, and Y . Fu, “Geovln: Learning geometry-enhanced visual representation with slot attention for vision-and-language navigation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023, pp. 23 212– 23 221

  37. [37]

    Vggt: Visual geometry grounded transformer,

    J. Wang, M. Chen, N. Karaev, A. Vedaldi, C. Rupprecht, and D. Novotny, “Vggt: Visual geometry grounded transformer,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025, pp. 5294–5306

  38. [38]

    Deciphering cross-modal alignment in large vision-language models with modality integration rate,

    Q. Huang, X. Dong, P. Zhang, Y . Zang, Y . Cao, J. Wang, D. Lin, W. Zhang, and N. Yu, “Deciphering cross-modal alignment in large vision-language models with modality integration rate,”arXiv preprint arXiv:2410.07167, 2024

  39. [39]

    Grounding dino: Marrying dino with grounded pre-training for open-set object detection,

    S. Liu, Z. Zeng, T. Ren, F. Li, H. Zhang, J. Yang, Q. Jiang, C. Li, J. Yang, H. Suet al., “Grounding dino: Marrying dino with grounded pre-training for open-set object detection,” in European Conference on Computer Vision (ECCV), 2024, pp. 38–55

  40. [40]

    Segment anything,

    A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y . Loet al., “Segment anything,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023, pp. 4015–4026

  41. [41]

    Vlfm: Vision-language frontier maps for zero-shot semantic navigation,

    N. Yokoyama, S. Ha, D. Batra, J. Wang, and B. Bucher, “Vlfm: Vision-language frontier maps for zero-shot semantic navigation,” inIEEE International Conference on Robotics and Automation (ICRA), 2024, pp. 42–48

  42. [42]

    Geometric Context Transformer for Streaming 3D Reconstruction

    L.-Z. Chen, J. Gao, Y . Chen, K. L. Cheng, Y . Sun, L. Hu, N. Xue, X. Zhu, Y . Shen, Y . Yaoet al., “Geometric context transformer for streaming 3d reconstruction,”arXiv preprint arXiv:2604.14141, 2026

  43. [43]

    Learning transferable visual models from natural language supervi- sion,

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clarket al., “Learning transferable visual models from natural language supervi- sion,” inInternational Conference on Machine Learning (ICML), 2021, pp. 8748–8763

  44. [44]

    Qwen2.5-Coder Technical Report

    B. Hui, J. Yang, Z. Cui, J. Yang, D. Liu, L. Zhang, T. Liu, J. Zhang, B. Yu, K. Luet al., “Qwen2.5-coder technical report,”arXiv preprint arXiv:2409.12186, 2024

  45. [45]

    Bridging the gap between learning in discrete and continuous environments for vision-and-language navigation,

    Y . Hong, Z. Wang, Q. Wu, and S. Gould, “Bridging the gap between learning in discrete and continuous environments for vision-and-language navigation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 15 439–15 449

  46. [46]

    Etpnav: Evolving topolog- ical planning for vision-language navigation in continuous environments,

    D. An, H. Wang, W. Wang, Z. Wang, Y . Huang, K. He, and L. Wang, “Etpnav: Evolving topolog- ical planning for vision-language navigation in continuous environments,”IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), pp. 5130–5145, 2025. 12

  47. [47]

    Language-aligned waypoint (LAW) supervision for vision-and-language navigation in continuous environments,

    S. Raychaudhuri, S. Wani, S. Patel, U. Jain, and A. Chang, “Language-aligned waypoint (LAW) supervision for vision-and-language navigation in continuous environments,” inProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2021, pp. 4018–4028

  48. [48]

    Gemini: A family of highly capable multimodal models,

    G. Team, “Gemini: A family of highly capable multimodal models,” 2023

  49. [49]

    Qwen2.5-vl technical report,

    S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, S. Song, K. Dang, P. Wang, S. Wang, J. Tang, H. Zhong, Y . Zhu, M. Yang, Z. Li, J. Wan, P. Wang, W. Ding, Z. Fu, Y . Xu, J. Ye, X. Zhang, T. Xie, Z. Cheng, H. Zhang, Z. Yang, H. Xu, and J. Lin, “Qwen2.5-vl technical report,” 2025. 13 A Technical appendices and supplementary material In this section, we first build ...