arxiv: 2605.06317 · v2 · submitted 2026-05-07 · 💻 cs.CV · cs.AI

Recognition: no theorem link

NavOne: One-Step Global Planning for Vision-Language Navigation on Top-Down Maps

Dijia Zhan , Jinyi Li , Chenxi Zheng , Shaoyu Huang , Yong Li , Jie Tang , Xuemiao Xu

Authors on Pith no claims yet

Pith reviewed 2026-05-11 00:44 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords Vision-Language NavigationTop-Down MapsGlobal Path PlanningDense Probability PredictionMulti-Modal FusionEfficient Navigation

0 comments

The pith

NavOne reformulates vision-language navigation as one-step global path planning via direct dense path probability prediction on pre-built top-down maps.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to establish that vision-language navigation can be solved more efficiently and accurately by treating it as a single global planning step on pre-built top-down maps rather than incremental egocentric steps. This matters because current methods suffer from error buildup over long sequences and are computationally slow due to repeated local decisions. NavOne implements the reformulation by fusing map modalities and predicting the full path probabilities in one forward pass. If correct, it allows navigation agents to plan complete routes instantly using map information.

Core claim

NavOne is a unified framework that directly predicts dense path probabilities over multi-modal maps in a single end-to-end forward pass. Supported by the Top-Down VLN reformulation and the R2R-TopDown dataset, it uses a Top-Down Map Fuser for joint representation and extends Attention Residuals for spatial-aware depth mixing. This yields state-of-the-art performance among map-based VLN methods along with major speed improvements.

What carries the argument

The one-step dense path probability prediction on fused top-down maps, which replaces discrete or incremental approaches with continuous global reasoning in a single pass.

If this is right

NavOne achieves state-of-the-art performance among map-based VLN methods.
It delivers a planning-stage speedup of 8x over existing map-based baselines.
It provides an 80x speedup over egocentric methods.
This enables highly efficient global navigation without the error accumulation of step-by-step paradigms.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This approach may scale better to long-horizon tasks by avoiding cumulative local errors.
It suggests that pre-built maps can serve as a foundation for other vision-based planning problems in robotics.
Future systems could integrate real-time map updates to handle changing environments while retaining the one-step prediction benefit.

Load-bearing premise

The central claim depends on having accurate pre-built top-down maps available and on dense path probability prediction being sufficient to guide successful navigation.

What would settle it

Deploy NavOne in a setting with inaccurate or missing top-down maps and measure whether its success rate drops below that of traditional step-by-step VLN methods.

Figures

Figures reproduced from arXiv: 2605.06317 by Chenxi Zheng, Dijia Zhan, Jie Tang, Jinyi Li, Shaoyu Huang, Xuemiao Xu, Yong Li.

**Figure 1.** Figure 1: Overview of NavOne. Given a language instruction and multi-modal top-down map inputs view at source ↗

**Figure 2.** Figure 2: Examples of multi-modal map inputs from R2R-TopDown. From left to right: RGB map, occupancy map (white=navigable, black=obstacle), semantic map (color-coded categories), and ground truth trajectory view at source ↗

**Figure 4.** Figure 4: Overview of our NavOne architecture. Multi-modal maps (RGB, occupancy, semantic) view at source ↗

**Figure 5.** Figure 5: Qualitative result: (a) predicted path (red) and ground truth (green) on the RGB map, (b) goal probability map, (c) path probability map. We present a representative success case in view at source ↗

**Figure 6.** Figure 6: Multi-room navigation example view at source ↗

**Figure 8.** Figure 8: Kitchen navigation example view at source ↗

**Figure 10.** Figure 10: Real-robot corridor navigation example 1 view at source ↗

read the original abstract

Existing Vision-Language Navigation (VLN) methods typically adopt an egocentric, step-by-step paradigm, which struggles with error accumulation and limits efficiency. While recent approaches attempt to leverage pre-built environment maps, they often rely on incrementally updating memory graphs or scoring discrete path proposals, which restricts continuous spatial reasoning and creates discrete bottlenecks. We propose Top-Down VLN (TD-VLN), reformulating navigation as a one-step global path planning problem on pre-built top-down maps, supported by our newly constructed R2R-TopDown dataset. To solve this, we introduce NavOne, a unified framework that directly predicts dense path probabilities over multi-modal maps in a single end-to-end forward pass. NavOne features a Top-Down Map Fuser for joint multi-modal map representation, and extends Attention Residuals for spatial-aware depth mixing. Extensive experiments on R2R-TopDown show that NavOne achieves state-of-the-art performance among map-based VLN methods, with a planning-stage speedup of 8x over existing map-based baselines and 80x over egocentric methods, enabling highly efficient global navigation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The one-step dense path prediction on top-down maps is a clean reformulation worth checking, but the SOTA and speedup claims need the full results to land.

read the letter

The paper's main move is to treat vision-language navigation as a single global planning step that outputs dense path probabilities over a pre-built top-down map instead of the usual incremental egocentric decisions. They introduce the R2R-TopDown dataset to support this and build NavOne around a Top-Down Map Fuser plus extended attention residuals for mixing modalities and spatial cues. That shift directly targets error accumulation and discrete bottlenecks in earlier map-based work, which is a straightforward and useful reframing. The reported planning-stage speedups—8x over other map methods and 80x over egocentric baselines—would matter if the accuracy holds up, and releasing a dataset aligned with top-down views gives the community something concrete to build on. What the work does well is keep the focus on continuous spatial reasoning rather than scoring discrete proposals or updating memory graphs step by step. On the soft spots, the abstract states state-of-the-art results among map-based methods without the actual numbers, ablations, or failure breakdowns visible here. The concern that one-pass probability prediction could still degrade on long or multi-room trajectories is reasonable; if the extracted paths lose reliability when instructions span complex layouts, the efficiency gains become less convincing. Reliance on pre-built maps is explicit but narrows the scope compared with fully online settings. This is aimed at researchers working on map-assisted or efficiency-focused VLN and embodied agents. A reader who wants to explore global planning alternatives or needs a new top-down dataset would get value from it. The reformulation and dataset are substantive enough that the paper deserves a serious referee, even if the experiments will need close checking on path quality and generalization. I would send it out for review rather than desk reject.

Referee Report

1 major / 2 minor

Summary. The paper introduces NavOne, a framework for Vision-Language Navigation (VLN) that reformulates the task as one-step global path planning via dense path probability prediction on pre-built top-down maps. It presents the Top-Down Map Fuser for multi-modal map integration, extends Attention Residuals for spatial depth mixing, and releases the R2R-TopDown dataset. Experiments claim state-of-the-art results among map-based VLN methods along with 8x planning speedup over map-based baselines and 80x over egocentric approaches.

Significance. If the empirical claims hold, the work provides a concrete alternative to incremental egocentric VLN pipelines by enabling single-pass global reasoning, which could improve both accuracy on long trajectories and real-time efficiency in embodied settings. The new dataset and the unified one-step architecture are constructive contributions that future map-based methods can build upon.

major comments (1)

[§4, Table 3] §4 (Experiments) and Table 3: The reported SOTA among map-based methods and the 8x/80x speedups are load-bearing for the central claim, yet the manuscript provides limited breakdown of how path extraction from the dense probability field behaves on the longest R2R-TopDown trajectories (e.g., multi-room instructions). Without per-trajectory length-stratified metrics or failure-case analysis, it remains unclear whether the one-step formulation truly avoids the error accumulation it targets or simply shifts failure modes to map-fusion artifacts.

minor comments (2)

[Abstract, §1] The abstract and §1 should explicitly state the evaluation metrics (e.g., success rate, SPL) used to declare SOTA, rather than leaving them implicit.
[§3] Notation for the Top-Down Map Fuser (e.g., how multi-modal features are concatenated before the attention residuals) could be clarified with a small diagram or explicit equations in §3.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on the experimental validation of NavOne. We address the concern regarding trajectory-level analysis below and will strengthen the manuscript accordingly.

read point-by-point responses

Referee: [§4, Table 3] §4 (Experiments) and Table 3: The reported SOTA among map-based methods and the 8x/80x speedups are load-bearing for the central claim, yet the manuscript provides limited breakdown of how path extraction from the dense probability field behaves on the longest R2R-TopDown trajectories (e.g., multi-room instructions). Without per-trajectory length-stratified metrics or failure-case analysis, it remains unclear whether the one-step formulation truly avoids the error accumulation it targets or simply shifts failure modes to map-fusion artifacts.

Authors: We agree that aggregate metrics alone leave room for deeper validation of the one-step global planning claim. The R2R-TopDown dataset does contain trajectories spanning multiple rooms and varying lengths, and the reported SOTA results reflect performance across this distribution. However, to more explicitly demonstrate that dense probability prediction mitigates error accumulation rather than merely relocating failures to map fusion, we will add length-stratified success rates (short/medium/long trajectories) and a dedicated failure-case analysis subsection in the revised §4. This will include qualitative examples contrasting NavOne's global path outputs against incremental baselines on multi-room instructions, clarifying the contribution of the unified multi-modal fuser and attention residuals. revision: yes

Circularity Check

0 steps flagged

No circularity in the derivation chain

full rationale

The paper reformulates VLN as one-step dense path probability prediction on pre-built top-down maps via the new NavOne framework (Top-Down Map Fuser + Attention Residuals) and supports it with a newly constructed R2R-TopDown dataset plus empirical SOTA and speedup results. No load-bearing step reduces by construction to its inputs: there are no self-definitional loops, fitted parameters renamed as predictions, uniqueness theorems imported from the same authors, or ansatzes smuggled via self-citation. The central claims rest on the architectural design and experimental validation rather than tautological re-labeling of inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that global planning on top-down maps can replace step-by-step navigation effectively. No free parameters or invented entities are mentioned in the abstract.

axioms (1)

domain assumption Pre-built top-down maps of the environment are available
The method is designed for use with pre-built maps as stated in the proposal of TD-VLN.

pith-pipeline@v0.9.0 · 5510 in / 1327 out tokens · 66756 ms · 2026-05-11T00:44:32.395108+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

40 extracted references · 5 canonical work pages · 1 internal anchor

[1]

Anderson, Q

P. Anderson, Q. Wu, D. Teney, J. Bruce, M. Johnson, N. S¨underhauf, I. Reid, S. Gould, and A. Van Den Hengel. Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 3674–3683, 2018

2018
[2]

Y . Qi, Q. Wu, P. Anderson, X. Wang, W. Y . Wang, C. Shen, and A. v. d. Hengel. Reverie: Remote embodied visual referring expression in real indoor environments. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 9982–9991, 2020

2020
[3]

Banerjee, J

S. Banerjee, J. Thomason, and J. Corso. The robotslang benchmark: Dialog-guided robot localization and navigation. InConference on Robot Learning, pages 1384–1393. PMLR, 2021

2021
[4]

Krantz, E

J. Krantz, E. Wijmans, A. Majumdar, D. Batra, and S. Lee. Beyond the nav-graph: Vision-and- language navigation in continuous environments. InEuropean Conference on Computer Vision, pages 104–120. Springer, 2020

2020
[5]

Chen, P.-L

S. Chen, P.-L. Guhur, C. Schmid, and I. Laptev. History aware multimodal transformer for vision- and-language navigation.Advances in neural information processing systems, 34:5834–5847, 2021

2021
[6]

Chen, P.-L

S. Chen, P.-L. Guhur, M. Tapaswi, C. Schmid, and I. Laptev. Think global, act local: Dual-scale graph transformer for vision-and-language navigation. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 16537–16547, 2022

2022
[7]

Zhang, Z

Y . Zhang, Z. Ma, J. Li, Y . Qiao, Z. Wang, J. Chai, Q. Wu, M. Bansal, and P. Kordjamshidi. Vision-and-language navigation today and tomorrow: A survey in the era of foundation models. CoRR, 2024

2024
[8]

X. Song, W. Chen, Y . Liu, W. Chen, G. Li, and L. Lin. Towards long-horizon vision-language navigation: Platform, benchmark and method. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 12078–12088, 2025

2025
[9]

Labb´e and F

M. Labb´e and F. Michaud. Rtab-map as an open-source lidar and visual simultaneous local- ization and mapping library for large-scale and long-term online operation.Journal of field robotics, 36(2):416–446, 2019

2019
[10]

Zhang, X

L. Zhang, X. Hao, Q. Xu, Q. Zhang, X. Zhang, P. Wang, J. Zhang, Z. Wang, S. Zhang, and R. Xu. Mapnav: A novel memory representation via annotated semantic maps for vlm-based vision-and-language navigation. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13032–13056, 2025

2025
[11]

Z. Wang, M. Li, M. Wu, M.-F. Moens, and T. Tuytelaars. Instruction-guided path planning with 3d semantic maps for vision-language navigation.Neurocomputing, 625:129457, 2025

2025
[12]

D. An, H. Wang, W. Wang, Z. Wang, Y . Huang, K. He, and L. Wang. Etpnav: Evolving topolog- ical planning for vision-language navigation in continuous environments.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024. 9

2024
[13]

X. Wang, Q. Huang, A. Celikyilmaz, J. Gao, D. Shen, Y .-F. Wang, W. Y . Wang, and L. Zhang. Reinforced cross-modal matching and self-supervised imitation learning for vision-language navigation. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 6629–6638, 2019

2019
[14]

Krantz, A

J. Krantz, A. Gokaslan, D. Batra, S. Lee, and O. Maksymets. Waypoint models for instruction- guided navigation in continuous environments. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 15162–15171, 2021

2021
[15]

J. Gao, X. Yao, and C. Xu. Fast-slow test-time adaptation for online vision-and-language navigation. InInternational conference on machine learning, 2023

2023
[16]

A. Szot, B. Mazoure, H. Agrawal, R. D. Hjelm, Z. Kira, and A. Toshev. Grounding multimodal large language models in actions.Advances in neural information processing systems, 37: 20198–20224, 2024

2024
[17]

J. Chen, B. Lin, X. Liu, L. Ma, X. Liang, and K.-Y . K. Wong. Affordances-oriented planning using foundation models for continuous vision-language navigation. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 23568–23576, 2025

2025
[18]

S. Wang, Y . Wang, W. Li, X. Cai, Y . Wang, M. Chen, K. Wang, Z. Su, D. Li, and Z. Fan. Aux-think: Exploring reasoning strategies for data-efficient vision-language navigation. In Advances in neural information processing systems, 2025

2025
[19]

D. An, Y . Qi, Y . Li, Y . Huang, L. Wang, T. Tan, and J. Shao. Bevbert: Multimodal map pre-training for language-guided navigation.Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023

2023
[20]

D. Kang, A. Perincherry, Z. Coalson, A. Gabriel, S. Lee, and S. Hong. Harnessing input- adaptive inference for efficient vln. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 8219–8229, 2025

2025
[21]

Zhao and L

L. Zhao and L. L. Wong. Learning to navigate in mazes with novel layouts using abstract top-down maps. InReinforcement Learning Conference, 2024

2024
[22]

C. Li, C. Zhang, S. Teufel, R. S. Doddipatla, and S. Stoyanchev. Semantic map-based generation of navigation instructions.arXiv preprint arXiv:2403.19603, 2024

work page arXiv 2024
[23]

Y . Hong, Y . Zhou, R. Zhang, F. Dernoncourt, T. Bui, S. Gould, and H. Tan. Learning naviga- tional visual representations with semantic map supervision. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 3055–3067, 2023

2023
[24]

S. Feng, Z. Wang, Y . Li, R. Kong, H. Cai, S. Wang, G. H. Lee, P. Li, and S. Jiang. Vpn: Visual prompt navigation.arXiv preprint arXiv:2508.01766, 2025

work page arXiv 2025
[25]

Topv-nav: Unlocking the top-view spatial reasoning potential of mllm for zero-shot object navigation,

L. Zhong, C. Gao, Z. Ding, Y . Liao, H. Ma, S. Zhang, X. Zhou, and S. Liu. Topv-nav: Unlocking the top-view spatial reasoning potential of mllm for zero-shot object navigation.arXiv preprint arXiv:2411.16425, 2024

work page arXiv 2024
[26]

Georgakis, K

G. Georgakis, K. Schmeckpeper, K. Wanchoo, S. Dan, E. Miltsakaki, D. Roth, and K. Daniilidis. Cross-modal map learning for vision and language navigation. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 15460–15470, 2022

2022
[27]

D. S. Chaplot, R. Salakhutdinov, A. Gupta, and S. Gupta. Neural topological slam for visual navigation. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 12875–12884, 2020

2020
[28]

Yamauchi

B. Yamauchi. A frontier-based approach for autonomous exploration. InProceedings 1997 IEEE International Symposium on Computational Intelligence in Robotics and Automation CIRA’97. ’Towards New Computational Principles for Robotics and Automation’, pages 146–151. IEEE, 1997. 10

1997
[29]

W. Hess, D. Kohler, H. Rapp, and D. Andor. Real-time loop closure in 2d lidar slam. In2016 IEEE international conference on robotics and automation (ICRA), pages 1271–1278. IEEE, 2016

2016
[30]

T. Shan, B. Englot, D. Meyers, W. Wang, C. Ratti, and D. Rus. Lio-sam: Tightly-coupled lidar inertial odometry via smoothing and mapping. In2020 IEEE/RSJ international conference on intelligent robots and systems (IROS), pages 5135–5142. IEEE, 2020

2020
[31]

W. Xu, Y . Cai, D. He, J. Lin, and F. Zhang. Fast-lio2: Fast direct lidar-inertial odometry.IEEE Transactions on Robotics, 38(4):2053–2073, 2022

2053
[32]

Z. Zhu, S. Peng, V . Larsson, W. Xu, H. Bao, Z. Cui, M. R. Oswald, and M. Pollefeys. Nice-slam: Neural implicit scalable encoding for slam. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12786–12796, 2022

2022
[33]

Keetha, J

N. Keetha, J. Karhade, K. M. Jatavallabhula, G. Yang, S. Scherer, D. Ramanan, and J. Luiten. Splatam: Splat track & map 3d gaussians for dense rgb-d slam. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 21357–21366, 2024

2024
[34]

Xiong, Y

R. Xiong, Y . Yang, D. He, K. Zheng, S. Zheng, C. Xing, H. Zhang, Y . Lan, L. Wang, and T. Liu. On layer normalization in the transformer architecture. InInternational conference on machine learning, pages 10524–10533. PMLR, 2020

2020
[35]

K. Team, G. Chen, Y . Zhang, J. Su, W. Xu, S. Pan, Y . Wang, Y . Wang, G. Chen, B. Yin, et al. Attention residuals.arXiv preprint arXiv:2603.15031, 2026

work page arXiv 2026
[36]

P. Chen, D. Ji, K. Lin, R. Zeng, T. Li, M. Tan, and C. Gan. Weakly-supervised multi-granularity map learning for vision-and-language navigation.Advances in Neural Information Processing Systems, 35:38149–38161, 2022

2022
[37]

Cheng, I

B. Cheng, I. Misra, A. G. Schwing, A. Kirillov, and R. Girdhar. Masked-attention mask transformer for universal image segmentation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1290–1299, 2022

2022
[38]

Savva, A

M. Savva, A. Kadian, O. Maksymets, Y . Zhao, E. Wijmans, B. Jain, J. Straub, J. Liu, V . Koltun, J. Malik, et al. Habitat: A platform for embodied ai research. InProceedings of the IEEE/CVF international conference on computer vision, pages 9339–9347, 2019

2019
[39]

L.-Z. Chen, J. Gao, Y . Chen, K. L. Cheng, Y . Sun, L. Hu, N. Xue, X. Zhu, Y . Shen, Y . Yao, et al. Geometric context transformer for streaming 3d reconstruction.arXiv preprint arXiv:2604.14141, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[40]

stay on the right side

D. Fox, W. Burgard, and S. Thrun. The dynamic window approach to collision avoidance.IEEE robotics & automation magazine, 4(1):23–33, 2002. 11 A R2R-TopDown Dataset Construction Details This appendix provides detailed procedures for constructing the R2R-TopDown dataset from the original R2R-CE benchmark. We transform egocentric visual observations into mu...

2002