NavOne: One-Step Global Planning for Vision-Language Navigation on Top-Down Maps

Chenxi Zheng; Dijia Zhan; Jie Tang; Jinyi Li; Shaoyu Huang; Xuemiao Xu; Yong Li

arxiv: 2605.06317 · v3 · pith:GHNNX26Gnew · submitted 2026-05-07 · 💻 cs.CV · cs.AI

NavOne: One-Step Global Planning for Vision-Language Navigation on Top-Down Maps

Dijia Zhan , Jinyi Li , Chenxi Zheng , Shaoyu Huang , Yong Li , Jie Tang , Xuemiao Xu This is my paper

Pith reviewed 2026-05-20 23:09 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords vision-language navigationtop-down mapsglobal path planningone-step navigationmulti-modal fusionR2R-TopDown datasetdense path prediction

0 comments

The pith

NavOne turns vision-language navigation into one-step global planning on pre-built top-down maps.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Traditional VLN methods accumulate errors through repeated egocentric steps and incur high computational cost. The paper reframes the problem as TD-VLN, a single global planning task that uses pre-built top-down maps to reason over the entire environment at once. NavOne implements this by fusing multi-modal map layers and outputting dense path probabilities through one end-to-end network pass. The resulting system reports state-of-the-art results on the R2R-TopDown dataset together with large reductions in planning time. A sympathetic reader cares because the shift promises to replace incremental error-prone decisions with direct, efficient global path selection.

Core claim

By reformulating vision-language navigation as Top-Down VLN on pre-built maps, NavOne directly predicts dense path probabilities over multi-modal maps in a single end-to-end forward pass. The framework uses a Top-Down Map Fuser to create a joint representation and extends Attention Residuals to enable spatial-aware depth mixing. Experiments on the newly constructed R2R-TopDown dataset show that this one-step method reaches state-of-the-art performance among map-based VLN approaches while delivering an 8x planning-stage speedup over existing map-based baselines and an 80x speedup over egocentric methods.

What carries the argument

The NavOne network that fuses multi-modal top-down maps and predicts dense path probabilities in a single forward pass.

If this is right

Navigation decisions shift from incremental local steps to a single global plan, reducing cumulative error.
Continuous spatial reasoning over the full map replaces discrete path-proposal bottlenecks.
Planning time drops by a factor of eight relative to prior map-based methods.
The same model delivers an eighty-fold speedup compared with egocentric step-by-step baselines.
Global navigation becomes feasible in larger environments where repeated local decisions become intractable.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If reliable top-down maps can be acquired on the fly, the same one-pass architecture could support online adaptation without retraining.
The dense probability output could be reused as a prior for uncertainty-aware planning or for guiding low-level controllers.
The method's reliance on map fusion suggests straightforward extension to additional sensor modalities such as semantic labels or occupancy grids.

Load-bearing premise

Accurate pre-built top-down maps of the environment are available and do not need to be constructed or corrected during navigation.

What would settle it

Measuring whether NavOne retains its reported accuracy and speed advantage when forced to operate without pre-existing maps or with maps that contain significant localization errors.

Figures

Figures reproduced from arXiv: 2605.06317 by Chenxi Zheng, Dijia Zhan, Jie Tang, Jinyi Li, Shaoyu Huang, Xuemiao Xu, Yong Li.

**Figure 1.** Figure 1: Overview of NavOne. Given a language instruction and multi-modal top-down map inputs view at source ↗

**Figure 2.** Figure 2: Examples of multi-modal map inputs from R2R-TopDown. From left to right: RGB map, occupancy map (white=navigable, black=obstacle), semantic map (color-coded categories), and ground truth trajectory view at source ↗

**Figure 4.** Figure 4: Overview of our NavOne architecture. Multi-modal maps (RGB, occupancy, semantic) view at source ↗

**Figure 5.** Figure 5: Qualitative result: (a) predicted path (red) and ground truth (green) on the RGB map, (b) goal probability map, (c) path probability map. We present a representative success case in view at source ↗

**Figure 6.** Figure 6: Multi-room navigation example view at source ↗

**Figure 8.** Figure 8: Kitchen navigation example view at source ↗

**Figure 10.** Figure 10: Real-robot corridor navigation example 1 view at source ↗

read the original abstract

Existing Vision-Language Navigation (VLN) methods typically adopt an egocentric, step-by-step paradigm, which struggles with error accumulation and limits efficiency. While recent approaches attempt to leverage pre-built environment maps, they often rely on incrementally updating memory graphs or scoring discrete path proposals, which restricts continuous spatial reasoning and creates discrete bottlenecks. We propose Top-Down VLN (TD-VLN), reformulating navigation as a one-step global path planning problem on pre-built top-down maps, supported by our newly constructed R2R-TopDown dataset. To solve this, we introduce NavOne, a unified framework that directly predicts dense path probabilities over multi-modal maps in a single end-to-end forward pass. NavOne features a Top-Down Map Fuser for joint multi-modal map representation, and extends Attention Residuals for spatial-aware depth mixing. Extensive experiments on R2R-TopDown show that NavOne achieves state-of-the-art performance among map-based VLN methods, with a planning-stage speedup of 8x over existing map-based baselines and 80x over egocentric methods, enabling highly efficient global navigation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

NavOne reframes VLN as single-pass dense path prediction on pre-built top-down maps plus a new dataset, which is a clean shift from incremental methods, but the speed and SOTA claims rest on perfect map inputs that lack robustness checks.

read the letter

The main thing to know is that this paper turns vision-language navigation into a one-step global planning task on complete top-down maps instead of the usual step-by-step egocentric route. They support it with the R2R-TopDown dataset and NavOne, which fuses multi-modal maps and predicts dense path probabilities in one forward pass using a Top-Down Map Fuser and extended Attention Residuals for spatial depth handling.

Referee Report

2 major / 1 minor

Summary. The paper reformulates Vision-Language Navigation as Top-Down VLN (TD-VLN), a one-step global path planning task on pre-built top-down maps, and introduces the NavOne framework. NavOne fuses multi-modal maps via a Top-Down Map Fuser, extends Attention Residuals for spatial depth mixing, and directly predicts dense path probabilities in a single end-to-end forward pass. It is evaluated on a newly constructed R2R-TopDown dataset and claims state-of-the-art results among map-based VLN methods plus planning speedups of 8x over map-based baselines and 80x over egocentric methods.

Significance. If the quantitative results hold, the shift to dense one-step global planning on top-down maps addresses error accumulation and discrete bottlenecks in prior VLN work, offering a clear efficiency gain. The new R2R-TopDown dataset and the unified multi-modal fusion architecture are constructive contributions that could support further research on map-based navigation.

major comments (2)

[Method description and Experiments section] The central SOTA and speedup claims rest on the assumption of accurate, complete pre-built top-down maps supplied as input. No experiments evaluate NavOne or the baselines under realistic map imperfections (pose noise, missing regions, dynamic obstacles), which is load-bearing because the single forward-pass dense prediction cannot incrementally recover from construction errors the way graph-updating methods can.
[Abstract] The abstract asserts specific quantitative gains (SOTA among map-based methods, 8x and 80x planning speedups) yet supplies no metrics, tables, baseline details, or error bars; verification of these claims therefore cannot be performed from the provided summary.

minor comments (1)

[Method] Notation for the multi-modal map representation and the exact form of the dense path probability output could be clarified with an explicit equation or diagram in the method section.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive feedback and positive assessment of the TD-VLN reformulation and NavOne framework. We address each major comment below with clarifications and revisions to the manuscript.

read point-by-point responses

Referee: [Method description and Experiments section] The central SOTA and speedup claims rest on the assumption of accurate, complete pre-built top-down maps supplied as input. No experiments evaluate NavOne or the baselines under realistic map imperfections (pose noise, missing regions, dynamic obstacles), which is load-bearing because the single forward-pass dense prediction cannot incrementally recover from construction errors the way graph-updating methods can.

Authors: The TD-VLN task is defined in Section 3.1 as one-step global planning given a pre-built top-down map; this assumption is core to the problem reformulation and enables the efficiency gains of the single forward pass. We agree that robustness to map imperfections is a relevant practical concern and that our current evaluation does not include such tests. In the revised manuscript we have added a dedicated paragraph in the Discussion section that explicitly acknowledges this limitation, contrasts the single-pass design with incremental graph methods, and outlines future directions involving uncertainty-aware map fusion. We have not added new experiments with injected noise or dynamic obstacles, as these would require a substantially extended evaluation protocol beyond the scope of the current contribution. revision: partial
Referee: [Abstract] The abstract asserts specific quantitative gains (SOTA among map-based methods, 8x and 80x planning speedups) yet supplies no metrics, tables, baseline details, or error bars; verification of these claims therefore cannot be performed from the provided summary.

Authors: Abstracts are concise summaries and conventionally omit detailed tables, error bars, and baseline specifications. The full manuscript reports all supporting metrics in Section 4 (Tables 1–3), including success rate, SPL, navigation error, and planning-time measurements with standard deviations; the reported 8× and 80× speedups are derived directly from the average planning times in Table 3. To improve verifiability from the abstract alone, we have added a short parenthetical reference to the primary metrics and the main comparison table. revision: yes

standing simulated objections not resolved

No experiments evaluate NavOne or the baselines under realistic map imperfections (pose noise, missing regions, dynamic obstacles)

Circularity Check

0 steps flagged

No circularity detected; derivation is self-contained

full rationale

The paper reformulates VLN as TD-VLN on pre-built top-down maps and introduces NavOne as a new end-to-end neural architecture with Top-Down Map Fuser and Attention Residuals for direct dense path probability prediction. This is evaluated on the authors' newly constructed R2R-TopDown dataset. No derivation step reduces by construction to fitted inputs, self-citations, or renamed prior results; the central claims rest on empirical performance of the proposed model rather than tautological redefinitions or load-bearing self-references. The framework is presented as independent of the baselines it outperforms.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available, so no explicit free parameters, axioms, or invented entities can be extracted. The method implicitly relies on standard neural network training but these are not detailed here.

pith-pipeline@v0.9.0 · 5741 in / 1285 out tokens · 40043 ms · 2026-05-20T23:09:52.769529+00:00 · methodology

Review history (3 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

NavOne ... directly predicts dense path probabilities over multi-modal maps in a single end-to-end forward pass

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages · 2 internal anchors

[1]

Anderson, Q

P. Anderson, Q. Wu, D. Teney, J. Bruce, M. Johnson, N. S¨underhauf, I. Reid, S. Gould, and A. Van Den Hengel. Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 3674–3683, 2018

work page 2018
[2]

Y . Qi, Q. Wu, P. Anderson, X. Wang, W. Y . Wang, C. Shen, and A. v. d. Hengel. Reverie: Remote embodied visual referring expression in real indoor environments. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 9982–9991, 2020

work page 2020
[3]

Banerjee, J

S. Banerjee, J. Thomason, and J. Corso. The robotslang benchmark: Dialog-guided robot localization and navigation. InConference on Robot Learning, pages 1384–1393. PMLR, 2021

work page 2021
[4]

Krantz, E

J. Krantz, E. Wijmans, A. Majumdar, D. Batra, and S. Lee. Beyond the nav-graph: Vision-and- language navigation in continuous environments. InEuropean Conference on Computer Vision, pages 104–120. Springer, 2020

work page 2020
[5]

Chen, P.-L

S. Chen, P.-L. Guhur, C. Schmid, and I. Laptev. History aware multimodal transformer for vision- and-language navigation.Advances in neural information processing systems, 34:5834–5847, 2021

work page 2021
[6]

Chen, P.-L

S. Chen, P.-L. Guhur, M. Tapaswi, C. Schmid, and I. Laptev. Think global, act local: Dual-scale graph transformer for vision-and-language navigation. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 16537–16547, 2022

work page 2022
[7]

Zhang, Z

Y . Zhang, Z. Ma, J. Li, Y . Qiao, Z. Wang, J. Chai, Q. Wu, M. Bansal, and P. Kordjamshidi. Vision-and-language navigation today and tomorrow: A survey in the era of foundation models. CoRR, 2024

work page 2024
[8]

X. Song, W. Chen, Y . Liu, W. Chen, G. Li, and L. Lin. Towards long-horizon vision-language navigation: Platform, benchmark and method. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 12078–12088, 2025

work page 2025
[9]

Labb´e and F

M. Labb´e and F. Michaud. Rtab-map as an open-source lidar and visual simultaneous local- ization and mapping library for large-scale and long-term online operation.Journal of field robotics, 36(2):416–446, 2019

work page 2019
[10]

Zhang, X

L. Zhang, X. Hao, Q. Xu, Q. Zhang, X. Zhang, P. Wang, J. Zhang, Z. Wang, S. Zhang, and R. Xu. Mapnav: A novel memory representation via annotated semantic maps for vlm-based vision-and-language navigation. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13032–13056, 2025

work page 2025
[11]

Z. Wang, M. Li, M. Wu, M.-F. Moens, and T. Tuytelaars. Instruction-guided path planning with 3d semantic maps for vision-language navigation.Neurocomputing, 625:129457, 2025

work page 2025
[12]

D. An, H. Wang, W. Wang, Z. Wang, Y . Huang, K. He, and L. Wang. Etpnav: Evolving topolog- ical planning for vision-language navigation in continuous environments.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024. 9

work page 2024
[13]

X. Wang, Q. Huang, A. Celikyilmaz, J. Gao, D. Shen, Y .-F. Wang, W. Y . Wang, and L. Zhang. Reinforced cross-modal matching and self-supervised imitation learning for vision-language navigation. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 6629–6638, 2019

work page 2019
[14]

Krantz, A

J. Krantz, A. Gokaslan, D. Batra, S. Lee, and O. Maksymets. Waypoint models for instruction- guided navigation in continuous environments. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 15162–15171, 2021

work page 2021
[15]

J. Gao, X. Yao, and C. Xu. Fast-slow test-time adaptation for online vision-and-language navigation. InInternational conference on machine learning, 2023

work page 2023
[16]

A. Szot, B. Mazoure, H. Agrawal, R. D. Hjelm, Z. Kira, and A. Toshev. Grounding multimodal large language models in actions.Advances in neural information processing systems, 37: 20198–20224, 2024

work page 2024
[17]

J. Chen, B. Lin, X. Liu, L. Ma, X. Liang, and K.-Y . K. Wong. Affordances-oriented planning using foundation models for continuous vision-language navigation. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 23568–23576, 2025

work page 2025
[18]

S. Wang, Y . Wang, W. Li, X. Cai, Y . Wang, M. Chen, K. Wang, Z. Su, D. Li, and Z. Fan. Aux-think: Exploring reasoning strategies for data-efficient vision-language navigation. In Advances in neural information processing systems, 2025

work page 2025
[19]

D. An, Y . Qi, Y . Li, Y . Huang, L. Wang, T. Tan, and J. Shao. Bevbert: Multimodal map pre-training for language-guided navigation. InProceedings of the IEEE/CVF International Conference on Computer Vision, 2023

work page 2023
[20]

D. Kang, A. Perincherry, Z. Coalson, A. Gabriel, S. Lee, and S. Hong. Harnessing input- adaptive inference for efficient vln. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 8219–8229, 2025

work page 2025
[21]

Zhao and L

L. Zhao and L. L. Wong. Learning to navigate in mazes with novel layouts using abstract top-down maps. InReinforcement Learning Conference, 2024

work page 2024
[22]

C. Li, C. Zhang, S. Teufel, R. S. Doddipatla, and S. Stoyanchev. Semantic map-based generation of navigation instructions.arXiv preprint arXiv:2403.19603, 2024

work page arXiv 2024
[23]

Y . Hong, Y . Zhou, R. Zhang, F. Dernoncourt, T. Bui, S. Gould, and H. Tan. Learning naviga- tional visual representations with semantic map supervision. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 3055–3067, 2023

work page 2023
[24]

S. Feng, Z. Wang, Y . Li, R. Kong, H. Cai, S. Wang, G. H. Lee, P. Li, and S. Jiang. Vpn: Visual prompt navigation.arXiv preprint arXiv:2508.01766, 2025

work page arXiv 2025
[25]

Topv-nav: Unlocking the top-view spatial reasoning potential of mllm for zero-shot object navigation,

L. Zhong, C. Gao, Z. Ding, Y . Liao, H. Ma, S. Zhang, X. Zhou, and S. Liu. Topv-nav: Unlocking the top-view spatial reasoning potential of mllm for zero-shot object navigation.arXiv preprint arXiv:2411.16425, 2024

work page arXiv 2024
[26]

Georgakis, K

G. Georgakis, K. Schmeckpeper, K. Wanchoo, S. Dan, E. Miltsakaki, D. Roth, and K. Daniilidis. Cross-modal map learning for vision and language navigation. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 15460–15470, 2022

work page 2022
[27]

D. S. Chaplot, R. Salakhutdinov, A. Gupta, and S. Gupta. Neural topological slam for visual navigation. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 12875–12884, 2020

work page 2020
[28]

Yamauchi

B. Yamauchi. A frontier-based approach for autonomous exploration. InProceedings 1997 IEEE International Symposium on Computational Intelligence in Robotics and Automation CIRA’97. ’Towards New Computational Principles for Robotics and Automation’, pages 146–151. IEEE, 1997. 10

work page 1997
[29]

W. Hess, D. Kohler, H. Rapp, and D. Andor. Real-time loop closure in 2d lidar slam. In2016 IEEE international conference on robotics and automation (ICRA), pages 1271–1278. IEEE, 2016

work page 2016
[30]

T. Shan, B. Englot, D. Meyers, W. Wang, C. Ratti, and D. Rus. Lio-sam: Tightly-coupled lidar inertial odometry via smoothing and mapping. In2020 IEEE/RSJ international conference on intelligent robots and systems (IROS), pages 5135–5142. IEEE, 2020

work page 2020
[31]

W. Xu, Y . Cai, D. He, J. Lin, and F. Zhang. Fast-lio2: Fast direct lidar-inertial odometry.IEEE Transactions on Robotics, 38(4):2053–2073, 2022

work page 2053
[32]

Z. Zhu, S. Peng, V . Larsson, W. Xu, H. Bao, Z. Cui, M. R. Oswald, and M. Pollefeys. Nice-slam: Neural implicit scalable encoding for slam. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12786–12796, 2022

work page 2022
[33]

Keetha, J

N. Keetha, J. Karhade, K. M. Jatavallabhula, G. Yang, S. Scherer, D. Ramanan, and J. Luiten. Splatam: Splat track & map 3d gaussians for dense rgb-d slam. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 21357–21366, 2024

work page 2024
[34]

Xiong, Y

R. Xiong, Y . Yang, D. He, K. Zheng, S. Zheng, C. Xing, H. Zhang, Y . Lan, L. Wang, and T. Liu. On layer normalization in the transformer architecture. InInternational conference on machine learning, pages 10524–10533. PMLR, 2020

work page 2020
[35]

K. Team, G. Chen, Y . Zhang, J. Su, W. Xu, S. Pan, Y . Wang, Y . Wang, G. Chen, B. Yin, et al. Attention residuals.arXiv preprint arXiv:2603.15031, 2026

work page internal anchor Pith review arXiv 2026
[36]

P. Chen, D. Ji, K. Lin, R. Zeng, T. Li, M. Tan, and C. Gan. Weakly-supervised multi-granularity map learning for vision-and-language navigation.Advances in Neural Information Processing Systems, 35:38149–38161, 2022

work page 2022
[37]

Cheng, I

B. Cheng, I. Misra, A. G. Schwing, A. Kirillov, and R. Girdhar. Masked-attention mask transformer for universal image segmentation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1290–1299, 2022

work page 2022
[38]

Savva, A

M. Savva, A. Kadian, O. Maksymets, Y . Zhao, E. Wijmans, B. Jain, J. Straub, J. Liu, V . Koltun, J. Malik, et al. Habitat: A platform for embodied ai research. InProceedings of the IEEE/CVF international conference on computer vision, pages 9339–9347, 2019

work page 2019
[39]

L.-Z. Chen, J. Gao, Y . Chen, K. L. Cheng, Y . Sun, L. Hu, N. Xue, X. Zhu, Y . Shen, Y . Yao, et al. Geometric context transformer for streaming 3d reconstruction.arXiv preprint arXiv:2604.14141, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[40]

stay on the right side

D. Fox, W. Burgard, and S. Thrun. The dynamic window approach to collision avoidance.IEEE Robotics & Automation Magazine, 4(1):23–33, 1997. doi:10.1109/100.580977. 11 A R2R-TopDown Dataset Construction Details This appendix provides detailed procedures for constructing the R2R-TopDown dataset from the original R2R-CE benchmark. We transform egocentric vis...

work page doi:10.1109/100.580977 1997

[1] [1]

Anderson, Q

P. Anderson, Q. Wu, D. Teney, J. Bruce, M. Johnson, N. S¨underhauf, I. Reid, S. Gould, and A. Van Den Hengel. Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 3674–3683, 2018

work page 2018

[2] [2]

Y . Qi, Q. Wu, P. Anderson, X. Wang, W. Y . Wang, C. Shen, and A. v. d. Hengel. Reverie: Remote embodied visual referring expression in real indoor environments. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 9982–9991, 2020

work page 2020

[3] [3]

Banerjee, J

S. Banerjee, J. Thomason, and J. Corso. The robotslang benchmark: Dialog-guided robot localization and navigation. InConference on Robot Learning, pages 1384–1393. PMLR, 2021

work page 2021

[4] [4]

Krantz, E

J. Krantz, E. Wijmans, A. Majumdar, D. Batra, and S. Lee. Beyond the nav-graph: Vision-and- language navigation in continuous environments. InEuropean Conference on Computer Vision, pages 104–120. Springer, 2020

work page 2020

[5] [5]

Chen, P.-L

S. Chen, P.-L. Guhur, C. Schmid, and I. Laptev. History aware multimodal transformer for vision- and-language navigation.Advances in neural information processing systems, 34:5834–5847, 2021

work page 2021

[6] [6]

Chen, P.-L

S. Chen, P.-L. Guhur, M. Tapaswi, C. Schmid, and I. Laptev. Think global, act local: Dual-scale graph transformer for vision-and-language navigation. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 16537–16547, 2022

work page 2022

[7] [7]

Zhang, Z

Y . Zhang, Z. Ma, J. Li, Y . Qiao, Z. Wang, J. Chai, Q. Wu, M. Bansal, and P. Kordjamshidi. Vision-and-language navigation today and tomorrow: A survey in the era of foundation models. CoRR, 2024

work page 2024

[8] [8]

X. Song, W. Chen, Y . Liu, W. Chen, G. Li, and L. Lin. Towards long-horizon vision-language navigation: Platform, benchmark and method. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 12078–12088, 2025

work page 2025

[9] [9]

Labb´e and F

M. Labb´e and F. Michaud. Rtab-map as an open-source lidar and visual simultaneous local- ization and mapping library for large-scale and long-term online operation.Journal of field robotics, 36(2):416–446, 2019

work page 2019

[10] [10]

Zhang, X

L. Zhang, X. Hao, Q. Xu, Q. Zhang, X. Zhang, P. Wang, J. Zhang, Z. Wang, S. Zhang, and R. Xu. Mapnav: A novel memory representation via annotated semantic maps for vlm-based vision-and-language navigation. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13032–13056, 2025

work page 2025

[11] [11]

Z. Wang, M. Li, M. Wu, M.-F. Moens, and T. Tuytelaars. Instruction-guided path planning with 3d semantic maps for vision-language navigation.Neurocomputing, 625:129457, 2025

work page 2025

[12] [12]

D. An, H. Wang, W. Wang, Z. Wang, Y . Huang, K. He, and L. Wang. Etpnav: Evolving topolog- ical planning for vision-language navigation in continuous environments.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024. 9

work page 2024

[13] [13]

X. Wang, Q. Huang, A. Celikyilmaz, J. Gao, D. Shen, Y .-F. Wang, W. Y . Wang, and L. Zhang. Reinforced cross-modal matching and self-supervised imitation learning for vision-language navigation. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 6629–6638, 2019

work page 2019

[14] [14]

Krantz, A

J. Krantz, A. Gokaslan, D. Batra, S. Lee, and O. Maksymets. Waypoint models for instruction- guided navigation in continuous environments. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 15162–15171, 2021

work page 2021

[15] [15]

J. Gao, X. Yao, and C. Xu. Fast-slow test-time adaptation for online vision-and-language navigation. InInternational conference on machine learning, 2023

work page 2023

[16] [16]

A. Szot, B. Mazoure, H. Agrawal, R. D. Hjelm, Z. Kira, and A. Toshev. Grounding multimodal large language models in actions.Advances in neural information processing systems, 37: 20198–20224, 2024

work page 2024

[17] [17]

J. Chen, B. Lin, X. Liu, L. Ma, X. Liang, and K.-Y . K. Wong. Affordances-oriented planning using foundation models for continuous vision-language navigation. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 23568–23576, 2025

work page 2025

[18] [18]

S. Wang, Y . Wang, W. Li, X. Cai, Y . Wang, M. Chen, K. Wang, Z. Su, D. Li, and Z. Fan. Aux-think: Exploring reasoning strategies for data-efficient vision-language navigation. In Advances in neural information processing systems, 2025

work page 2025

[19] [19]

D. An, Y . Qi, Y . Li, Y . Huang, L. Wang, T. Tan, and J. Shao. Bevbert: Multimodal map pre-training for language-guided navigation. InProceedings of the IEEE/CVF International Conference on Computer Vision, 2023

work page 2023

[20] [20]

D. Kang, A. Perincherry, Z. Coalson, A. Gabriel, S. Lee, and S. Hong. Harnessing input- adaptive inference for efficient vln. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 8219–8229, 2025

work page 2025

[21] [21]

Zhao and L

L. Zhao and L. L. Wong. Learning to navigate in mazes with novel layouts using abstract top-down maps. InReinforcement Learning Conference, 2024

work page 2024

[22] [22]

C. Li, C. Zhang, S. Teufel, R. S. Doddipatla, and S. Stoyanchev. Semantic map-based generation of navigation instructions.arXiv preprint arXiv:2403.19603, 2024

work page arXiv 2024

[23] [23]

Y . Hong, Y . Zhou, R. Zhang, F. Dernoncourt, T. Bui, S. Gould, and H. Tan. Learning naviga- tional visual representations with semantic map supervision. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 3055–3067, 2023

work page 2023

[24] [24]

S. Feng, Z. Wang, Y . Li, R. Kong, H. Cai, S. Wang, G. H. Lee, P. Li, and S. Jiang. Vpn: Visual prompt navigation.arXiv preprint arXiv:2508.01766, 2025

work page arXiv 2025

[25] [25]

Topv-nav: Unlocking the top-view spatial reasoning potential of mllm for zero-shot object navigation,

L. Zhong, C. Gao, Z. Ding, Y . Liao, H. Ma, S. Zhang, X. Zhou, and S. Liu. Topv-nav: Unlocking the top-view spatial reasoning potential of mllm for zero-shot object navigation.arXiv preprint arXiv:2411.16425, 2024

work page arXiv 2024

[26] [26]

Georgakis, K

G. Georgakis, K. Schmeckpeper, K. Wanchoo, S. Dan, E. Miltsakaki, D. Roth, and K. Daniilidis. Cross-modal map learning for vision and language navigation. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 15460–15470, 2022

work page 2022

[27] [27]

D. S. Chaplot, R. Salakhutdinov, A. Gupta, and S. Gupta. Neural topological slam for visual navigation. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 12875–12884, 2020

work page 2020

[28] [28]

Yamauchi

B. Yamauchi. A frontier-based approach for autonomous exploration. InProceedings 1997 IEEE International Symposium on Computational Intelligence in Robotics and Automation CIRA’97. ’Towards New Computational Principles for Robotics and Automation’, pages 146–151. IEEE, 1997. 10

work page 1997

[29] [29]

W. Hess, D. Kohler, H. Rapp, and D. Andor. Real-time loop closure in 2d lidar slam. In2016 IEEE international conference on robotics and automation (ICRA), pages 1271–1278. IEEE, 2016

work page 2016

[30] [30]

T. Shan, B. Englot, D. Meyers, W. Wang, C. Ratti, and D. Rus. Lio-sam: Tightly-coupled lidar inertial odometry via smoothing and mapping. In2020 IEEE/RSJ international conference on intelligent robots and systems (IROS), pages 5135–5142. IEEE, 2020

work page 2020

[31] [31]

W. Xu, Y . Cai, D. He, J. Lin, and F. Zhang. Fast-lio2: Fast direct lidar-inertial odometry.IEEE Transactions on Robotics, 38(4):2053–2073, 2022

work page 2053

[32] [32]

Z. Zhu, S. Peng, V . Larsson, W. Xu, H. Bao, Z. Cui, M. R. Oswald, and M. Pollefeys. Nice-slam: Neural implicit scalable encoding for slam. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12786–12796, 2022

work page 2022

[33] [33]

Keetha, J

N. Keetha, J. Karhade, K. M. Jatavallabhula, G. Yang, S. Scherer, D. Ramanan, and J. Luiten. Splatam: Splat track & map 3d gaussians for dense rgb-d slam. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 21357–21366, 2024

work page 2024

[34] [34]

Xiong, Y

R. Xiong, Y . Yang, D. He, K. Zheng, S. Zheng, C. Xing, H. Zhang, Y . Lan, L. Wang, and T. Liu. On layer normalization in the transformer architecture. InInternational conference on machine learning, pages 10524–10533. PMLR, 2020

work page 2020

[35] [35]

K. Team, G. Chen, Y . Zhang, J. Su, W. Xu, S. Pan, Y . Wang, Y . Wang, G. Chen, B. Yin, et al. Attention residuals.arXiv preprint arXiv:2603.15031, 2026

work page internal anchor Pith review arXiv 2026

[36] [36]

P. Chen, D. Ji, K. Lin, R. Zeng, T. Li, M. Tan, and C. Gan. Weakly-supervised multi-granularity map learning for vision-and-language navigation.Advances in Neural Information Processing Systems, 35:38149–38161, 2022

work page 2022

[37] [37]

Cheng, I

B. Cheng, I. Misra, A. G. Schwing, A. Kirillov, and R. Girdhar. Masked-attention mask transformer for universal image segmentation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1290–1299, 2022

work page 2022

[38] [38]

Savva, A

M. Savva, A. Kadian, O. Maksymets, Y . Zhao, E. Wijmans, B. Jain, J. Straub, J. Liu, V . Koltun, J. Malik, et al. Habitat: A platform for embodied ai research. InProceedings of the IEEE/CVF international conference on computer vision, pages 9339–9347, 2019

work page 2019

[39] [39]

L.-Z. Chen, J. Gao, Y . Chen, K. L. Cheng, Y . Sun, L. Hu, N. Xue, X. Zhu, Y . Shen, Y . Yao, et al. Geometric context transformer for streaming 3d reconstruction.arXiv preprint arXiv:2604.14141, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[40] [40]

stay on the right side

D. Fox, W. Burgard, and S. Thrun. The dynamic window approach to collision avoidance.IEEE Robotics & Automation Magazine, 4(1):23–33, 1997. doi:10.1109/100.580977. 11 A R2R-TopDown Dataset Construction Details This appendix provides detailed procedures for constructing the R2R-TopDown dataset from the original R2R-CE benchmark. We transform egocentric vis...

work page doi:10.1109/100.580977 1997