Progress-Think: Semantic Progress Reasoning for Vision-Language Navigation

Bo Zhang; Deying Li; Guoxin Lian; Kaihui Wang; Maiyue Chen; Shuo Wang; Wanting Li; Yongcai Wang; Yucheng Wang; Yutian Zhou

arxiv: 2511.17097 · v2 · submitted 2025-11-21 · 💻 cs.RO

Progress-Think: Semantic Progress Reasoning for Vision-Language Navigation

Shuo Wang , Yucheng Wang , Guoxin Lian , Yongcai Wang , Maiyue Chen , Kaihui Wang , Bo Zhang , Zhizhong Su

show 4 more authors

Yutian Zhou Wanting Li Deying Li Zhaoxin Fan

This is my paper

Pith reviewed 2026-05-17 20:59 UTC · model grok-4.3

classification 💻 cs.RO

keywords Vision-Language NavigationSemantic Progress ReasoningProgress-Guided PolicyNavigation AdvancementVision-Language-Action ModelsR2R-CERxR-CE

0 comments

The pith

Semantic progress reasoning from visual observations produces a more consistent sense of advancement in vision-language navigation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that agents fail on long instruction sequences because they lack a reliable way to know how far they have come. It proposes predicting progress in the same style as the original instructions, using the fact that visual observations and instruction steps advance together monotonically. A three-stage process first pretrains a module to align visual history with instruction prefixes, then uses those progress states to shape the navigation policy, and finally tunes both together with progress-aware rewards. If correct, this yields higher success and fewer wasted steps on standard benchmarks without needing manual progress labels. The core idea is that language-style progress descriptions give the agent a clearer internal map of its position in the overall task than numeric scores or direct action prediction alone.

Core claim

By predicting instruction-style progress descriptions directly from visual history, the model exploits the monotonic co-progression of observations and instructions to create a more consistent internal representation of how far the agent has advanced through a multi-step navigation command.

What carries the argument

Semantic progress reasoning module that generates progress statements matching instruction prefixes from current visual history, trained via differentiable alignment and then injected into the policy.

If this is right

Navigation agents maintain better alignment with the remaining instruction over long horizons.
The method reaches state-of-the-art success and efficiency on R2R-CE and RxR-CE.
Progress states can be learned without expensive manual annotations through self-aligned pretraining.
Joint optimization of the progress module and policy produces mutually reinforcing improvements.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same progress-prediction approach might transfer to other long-horizon tasks that pair visual streams with language goals.
If the monotonic alignment holds under real-world sensor noise, explicit metric localization could become less necessary.
Different instruction phrasings might require testing to check whether the learned progress representations remain stable.

Load-bearing premise

Visual observations and instruction sequences always advance together in a monotonic way with no major mismatches or reversals.

What would settle it

Training the same navigation backbone with versus without the semantic progress module on R2R-CE and measuring whether success rate or path efficiency shows no gain or a drop.

Figures

Figures reproduced from arXiv: 2511.17097 by Bo Zhang, Deying Li, Guoxin Lian, Kaihui Wang, Maiyue Chen, Shuo Wang, Wanting Li, Yongcai Wang, Yucheng Wang, Yutian Zhou, Zhaoxin Fan, Zhizhong Su.

**Figure 2.** Figure 2: Overview of the Progress-Think framework and annotation-free training pipeline. Compared with the vanilla Vision-Language [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Qualitative comparison of progress reasoning quality. Across two representative scenes, we compare how different models infer [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

read the original abstract

Vision-Language Navigation requires agents to act coherently over long horizons by understanding not only local visual context but also how far they have advanced within a multi-step instruction. However, recent Vision-Language-Action models focus on direct action prediction and earlier progress methods predict numeric achievements; both overlook the monotonic co-progression property of the observation and instruction sequences. Building on this insight, Progress-Think introduces semantic progress reasoning, predicting instruction-style progress from visual observations to enable more accurate navigation. To achieve this without expensive annotations, we propose a three-stage framework. In the initial stage, Self-Aligned Progress Pretraining bootstraps a reasoning module via a novel differentiable alignment between visual history and instruction prefixes. Then, Progress-Guided Policy Pretraining injects learned progress states into the navigation context, guiding the policy toward consistent actions. Finally, Progress-Policy Co-Finetuning jointly optimizes both modules with tailored progress-aware reinforcement objectives. Experiments on R2R-CE and RxR-CE show state-of-the-art success and efficiency, demonstrating that semantic progress yields a more consistent representation of navigation advancement.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Progress-Think replaces numeric progress with semantic descriptions and bootstraps them via differentiable alignment, but the monotonic co-progression assumption looks shaky for real agent trajectories.

read the letter

The main takeaway is that this paper tries to improve long-horizon coherence in vision-language navigation by making the agent predict progress in natural-language style rather than as a number. They do this with a three-stage pipeline that starts by aligning visual history to instruction prefixes, then feeds the resulting states into policy pretraining, and ends with joint fine-tuning under progress-aware rewards. That specific combination of semantic output and self-aligned bootstrapping is not in the direct-action or numeric-progress baselines they mention in the abstract.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes Progress-Think, a three-stage framework for Vision-Language Navigation that introduces semantic progress reasoning. Stage 1 performs Self-Aligned Progress Pretraining via a differentiable alignment loss between visual history and instruction prefixes; Stage 2 injects the resulting progress states into a navigation policy; Stage 3 performs joint co-finetuning with progress-aware reinforcement objectives. The central claim is that this yields a more consistent representation of navigation advancement than direct action prediction or numeric progress methods, producing state-of-the-art success and efficiency on R2R-CE and RxR-CE.

Significance. If the empirical claims are substantiated, the work offers a concrete mechanism for exploiting the monotonic co-progression property in VLN without requiring manual progress annotations. The self-aligned pretraining and progress-aware RL objectives constitute a reusable training recipe that could improve long-horizon coherence in embodied agents.

major comments (2)

[§3.1] §3.1 (Self-Aligned Progress Pretraining): the differentiable alignment loss is derived under the assumption that observation sequences and instruction prefixes co-progress monotonically. The manuscript does not demonstrate that the loss remains well-behaved or that the resulting progress states remain informative when the agent executes detours, backtracks, or enters incorrect rooms—precisely the trajectories that occur in deployed VLN policies. A concrete robustness experiment or failure-case analysis on non-monotonic rollouts is required to support the claim that semantic progress yields a more consistent representation.
[Experiments] Experiments section and Table 1: the abstract asserts state-of-the-art results on R2R-CE and RxR-CE, yet the manuscript supplies neither the precise success-rate, SPL, or efficiency numbers, nor ablations isolating the contribution of each stage, nor error analysis on trajectories where monotonicity is violated. Without these, the central empirical claim remains provisional.

minor comments (2)

[Abstract] Abstract: include the key quantitative metrics (success rate, SPL, etc.) and the main ablation result so that the strength of the SOTA claim is immediately visible.
[§3] Notation: define the semantic progress state representation (e.g., token sequence, embedding, or discrete label) explicitly before it is used in the policy context.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and indicate planned revisions to strengthen the manuscript.

read point-by-point responses

Referee: [§3.1] §3.1 (Self-Aligned Progress Pretraining): the differentiable alignment loss is derived under the assumption that observation sequences and instruction prefixes co-progress monotonically. The manuscript does not demonstrate that the loss remains well-behaved or that the resulting progress states remain informative when the agent executes detours, backtracks, or enters incorrect rooms—precisely the trajectories that occur in deployed VLN policies. A concrete robustness experiment or failure-case analysis on non-monotonic rollouts is required to support the claim that semantic progress yields a more consistent representation.

Authors: We agree that robustness to non-monotonic trajectories is a key consideration for deployed policies. The pretraining stage uses expert demonstrations that satisfy monotonic co-progression, but we recognize the need to evaluate behavior under detours and backtracks. In the revised manuscript we will add a dedicated robustness subsection with experiments on rollouts containing induced detours, backtracks, and incorrect-room entries. We will report alignment loss values, progress-state informativeness, and comparisons to numeric progress baselines on these cases to demonstrate that semantic progress remains more consistent than alternatives. revision: partial
Referee: [Experiments] Experiments section and Table 1: the abstract asserts state-of-the-art results on R2R-CE and RxR-CE, yet the manuscript supplies neither the precise success-rate, SPL, or efficiency numbers, nor ablations isolating the contribution of each stage, nor error analysis on trajectories where monotonicity is violated. Without these, the central empirical claim remains provisional.

Authors: The referee correctly notes that the current version does not present the precise numerical results or stage-wise ablations explicitly in the main text. We will revise the Experiments section and Table 1 to report the exact success rates, SPL, and efficiency metrics on R2R-CE and RxR-CE. We will also add comprehensive ablations isolating the contribution of each of the three stages and include an error analysis on trajectories that violate monotonicity, such as those with backtracking or incorrect rooms. revision: yes

Circularity Check

0 steps flagged

Staged pretraining draws on external data; no reduction of target metric to fitted internal parameter

full rationale

The paper's central derivation proceeds via a three-stage pipeline: differentiable alignment in Self-Aligned Progress Pretraining bootstraps a progress module from visual history and instruction prefixes, followed by injection into policy pretraining and joint co-finetuning with progress-aware RL objectives. This chain relies on the stated monotonic co-progression assumption and external visual-instruction data rather than defining the progress prediction as a direct function of the final navigation success metric or fitting it to a subset of the evaluation trajectories. No equation or self-citation is shown to force the claimed consistency benefit by construction; the alignment loss is presented as a novel mechanism whose validity is tested on R2R-CE and RxR-CE benchmarks. The assumption of monotonicity is an explicit modeling choice whose violation would degrade performance, but it does not render the overall result tautological.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The approach rests on one domain assumption about sequence alignment and introduces semantic progress states as a new representational device without external falsification mentioned.

axioms (1)

domain assumption The observation and instruction sequences exhibit a monotonic co-progression property.
Explicitly invoked as the foundational insight enabling semantic progress reasoning.

invented entities (1)

semantic progress states no independent evidence
purpose: Represent navigation advancement in natural-language instruction format to condition the policy.
New representational construct introduced to replace numeric progress scores.

pith-pipeline@v0.9.0 · 5525 in / 1172 out tokens · 65540 ms · 2026-05-17T20:59:19.480204+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/ArithmeticFromLogic.lean embed_strictMono_of_one_lt echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

Since progress should evolve monotonically with the visual observation sequence, an earlier timestep should correspond to a prefix of a later one... Lmono = E max(0, k_ti - k_tj)
IndisputableMonolith/Cost/FunctionalEquation.lean Jcost_pos_of_ne_one echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

Progress-Length Reward... rlen = 1 if |Ît| ≤ |I| else -β(|Ît| - |I|)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

SpaAct: Spatially-Activated Transition Learning with Curriculum Adaptation for Vision-Language Navigation
cs.CV 2026-04 unverdicted novelty 6.0

SpaAct activates spatial awareness in VLMs using action retrospection, future frame prediction, and progressive curriculum learning to reach SOTA on VLN-CE benchmarks.

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages · cited by 1 Pith paper · 7 internal anchors

[1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ah- mad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Bevbert: Multimodal map pre-training for language-guided navigation.arXiv preprint arXiv:2212.04385, 2022

Dong An, Yuankai Qi, Yangguang Li, Yan Huang, Liang Wang, Tieniu Tan, and Jing Shao. Bevbert: Multimodal map pre-training for language-guided navigation.arXiv preprint arXiv:2212.04385, 2022. 6

work page arXiv 2022
[3]

Etpnav: Evolving topo- logical planning for vision-language navigation in continu- ous environments.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024

Dong An, Hanqing Wang, Wenguan Wang, Zun Wang, Yan Huang, Keji He, and Liang Wang. Etpnav: Evolving topo- logical planning for vision-language navigation in continu- ous environments.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024. 6

work page 2024
[4]

Vision-and-language navigation: In- terpreting visually-grounded navigation instructions in real environments

Peter Anderson, Qi Wu, Damien Teney, Jake Bruce, Mark Johnson, Niko S ¨underhauf, Ian Reid, Stephen Gould, and Anton Van Den Hengel. Vision-and-language navigation: In- terpreting visually-grounded navigation instructions in real environments. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 3674–3683,

work page
[5]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 2

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

Weakly- supervised multi-granularity map learning for vision-and- language navigation.Advances in Neural Information Pro- cessing Systems, 35:38149–38161, 2022

Peihao Chen, Dongyu Ji, Kunyang Lin, Runhao Zeng, Thomas Li, Mingkui Tan, and Chuang Gan. Weakly- supervised multi-granularity map learning for vision-and- language navigation.Advances in Neural Information Pro- cessing Systems, 35:38149–38161, 2022. 6, 7

work page 2022
[7]

a2nav: Action-aware zero-shot robot navigation by exploit- ing vision-and-language ability of foundation models.arXiv preprint arXiv:2308.07997, 2023

Peihao Chen, Xinyu Sun, Hongyan Zhi, Runhao Zeng, Thomas H Li, Gaowen Liu, Mingkui Tan, and Chuang Gan. A2 nav: Action-aware zero-shot robot navigation by exploit- ing vision-and-language ability of foundation models.arXiv preprint arXiv:2308.07997, 2023. 7

work page arXiv 2023
[8]

NaVILA: Legged Robot Vision-Language-Action Model for Naviga- tion

An-Chieh Cheng, Yandong Ji, Zhaojing Yang, Zaitian Gongye, Xueyan Zou, Jan Kautz, Erdem Bıyık, Hongxu Yin, Sifei Liu, and Xiaolong Wang. Navila: Legged robot vision-language-action model for navigation.arXiv preprint arXiv:2412.04453, 2024. 1, 2, 6, 8

work page arXiv 2024
[9]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blis- tein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025. 2

work page internal anchor Pith review Pith/arXiv arXiv 2025
[10]

Cross-modal map learning for vision and language navigation

Georgios Georgakis, Karl Schmeckpeper, Karan Wanchoo, Soham Dan, Eleni Miltsakaki, Dan Roth, and Kostas Dani- ilidis. Cross-modal map learning for vision and language navigation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 15460– 15470, 2022. 6, 7

work page 2022
[11]

Vision-and-language navigation: A sur- vey of tasks, methods, and future directions.arXiv preprint arXiv:2203.12667, 2022

Jing Gu, Eliana Stefani, Qi Wu, Jesse Thomason, and Xin Eric Wang. Vision-and-language navigation: A sur- vey of tasks, methods, and future directions.arXiv preprint arXiv:2203.12667, 2022. 1

work page arXiv 2022
[12]

Beyond the nav-graph: Vision-and-language navigation in continuous environments

Jacob Krantz, Erik Wijmans, Arjun Majumdar, Dhruv Batra, and Stefan Lee. Beyond the nav-graph: Vision-and-language navigation in continuous environments. InComputer Vision– ECCV 2020: 16th European Conference, Glasgow, UK, Au- gust 23–28, 2020, Proceedings, Part XXVIII 16, pages 104–

work page 2020
[13]

Springer, 2020. 5, 6, 7

work page 2020
[14]

Room-across-room: Multilingual vision- and-language navigation with dense spatiotemporal ground- ing.arXiv preprint arXiv:2010.07954, 2020

Alexander Ku, Peter Anderson, Roma Patel, Eugene Ie, and Jason Baldridge. Room-across-room: Multilingual vision- and-language navigation with dense spatiotemporal ground- ing.arXiv preprint arXiv:2010.07954, 2020. 5

work page arXiv 2010
[15]

Navid-4d: Unleashing spatial intel- ligence in egocentric rgb-d videos for vision-and-language navigation

Haoran Liu, Weikang Wan, Xiqian Yu, Minghan Li, Jiazhao Zhang, Bo Zhao, Zhibo Chen, Zhongyuan Wang, Zhizheng Zhang, and He Wang. Navid-4d: Unleashing spatial intel- ligence in egocentric rgb-d videos for vision-and-language navigation. 1, 2, 6

work page
[16]

Nav-r1: Reasoning and navigation in embodied scenes

Qingxiang Liu, Ting Huang, Zeyu Zhang, and Hao Tang. Nav-r1: Reasoning and navigation in embodied scenes. arXiv preprint arXiv:2509.10884, 2025. 2

work page arXiv 2025
[17]

Vision-language navigation with energy-based policy.arXiv preprint arXiv:2410.14250, 2024

Rui Liu, Wenguan Wang, and Yi Yang. Vision-language navigation with energy-based policy.arXiv preprint arXiv:2410.14250, 2024. 6

work page arXiv 2024
[18]

NVILA: Efficient Frontier Visual Language Models

Zhijian Liu, Ligeng Zhu, Baifeng Shi, Zhuoyang Zhang, Yuming Lou, Shang Yang, Haocheng Xi, Shiyi Cao, Yux- ian Gu, Dacheng Li, et al. Nvila: Efficient frontier visual language models.arXiv preprint arXiv:2412.04468, 2024. 2, 5, 6, 8

work page internal anchor Pith review Pith/arXiv arXiv 2024
[19]

The regretful agent: Heuristic-aided navigation through progress estimation

Chih-Yao Ma, Zuxuan Wu, Ghassan AlRegib, Caiming Xiong, and Zsolt Kira. The regretful agent: Heuristic-aided navigation through progress estimation. InProceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, pages 6732–6740, 2019. 2

work page 2019
[20]

Vision-based navigation with language-based assis- tance via imitation learning with indirect intervention

Khanh Nguyen, Debadeepta Dey, Chris Brockett, and Bill Dolan. Vision-based navigation with language-based assis- tance via imitation learning with indirect intervention. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 12527–12537, 2019. 2

work page 2019
[21]

Language-aligned waypoint (LAW) su- pervision for vision-and-language navigation in continuous environments

Sonia Raychaudhuri, Saim Wani, Shivansh Patel, Unnat Jain, and Angel Chang. Language-aligned waypoint (LAW) su- pervision for vision-and-language navigation in continuous environments. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 4018–4028, Online and Punta Cana, Dominican Republic,

work page 2021
[22]

Association for Computational Linguistics. 6, 7

work page
[23]

A re- duction of imitation learning and structured prediction to no- regret online learning

St ´ephane Ross, Geoffrey Gordon, and Drew Bagnell. A re- duction of imitation learning and structured prediction to no- regret online learning. InProceedings of the fourteenth inter- national conference on artificial intelligence and statistics, pages 627–635. JMLR Workshop and Conference Proceed- ings, 2011. 5

work page 2011
[24]

Velma: Verbaliza- tion embodiment of llm agents for vision and language navi- gation in street view

Raphael Schumann, Wanrong Zhu, Weixi Feng, Tsu-Jui Fu, Stefan Riezler, and William Yang Wang. Velma: Verbaliza- tion embodiment of llm agents for vision and language navi- gation in street view. InProceedings of the AAAI Conference on Artificial Intelligence, pages 18924–18933, 2024. 2

work page 2024
[25]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathe- 9 matical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. 4

work page internal anchor Pith review Pith/arXiv arXiv 2024
[26]

One step at a time: Long-horizon vision-and-language navigation with milestones

Chan Hee Song, Jihyung Kil, Tai-Yu Pan, Brian M Sadler, Wei-Lun Chao, and Yu Su. One step at a time: Long-horizon vision-and-language navigation with milestones. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 15482–15491, 2022. 2

work page 2022
[27]

Towards versatile embodied navigation.Advances in neural information processing systems, 35:36858–36874,

Hanqing Wang, Wei Liang, Luc V Gool, and Wenguan Wang. Towards versatile embodied navigation.Advances in neural information processing systems, 35:36858–36874,

work page
[28]

Aux-think: Exploring reason- ing strategies for data-efficient vision-language navigation

Shuo Wang, Yongcai Wang, Wanting Li, Xudong Cai, Yucheng Wang, Maiyue Chen, Kaihui Wang, Zhizhong Su, Deying Li, and Zhaoxin Fan. Aux-think: Exploring reason- ing strategies for data-efficient vision-language navigation. Advances in Neural Information Processing Systems, 2025. 2, 5, 7, 8

work page 2025
[29]

Aux-think: Exploring reason- ing strategies for data-efficient vision-language navigation

Shuo Wang, Yongcai Wang, Wanting Li, Xudong Cai, Yucheng Wang, Maiyue Chen, Kaihui Wang, Zhizhong Su, Deying Li, and Zhaoxin Fan. Aux-think: Exploring reason- ing strategies for data-efficient vision-language navigation. arXiv preprint arXiv:2505.11886, 2025. 2, 6

work page arXiv 2025
[30]

Monodream: Monocular vision-language navigation with panoramic dreaming.arXiv preprint arXiv:2508.02549, 2025

Shuo Wang, Yongcai Wang, Wanting Li, Yucheng Wang, Maiyue Chen, Kaihui Wang, Zhizhong Su, Xudong Cai, Yeying Jin, Deying Li, et al. Monodream: Monocular vision-language navigation with panoramic dreaming.arXiv preprint arXiv:2508.02549, 2025. 2, 6, 7

work page arXiv 2025
[31]

Scaling data generation in vision-and-language navigation

Zun Wang, Jialu Li, Yicong Hong, Yi Wang, Qi Wu, Mohit Bansal, Stephen Gould, Hao Tan, and Yu Qiao. Scaling data generation in vision-and-language navigation. InProceed- ings of the IEEE/CVF international conference on computer vision, pages 12009–12020, 2023. 5

work page 2023
[32]

Sim-to-real transfer via 3d feature fields for vision-and-language navigation.arXiv preprint arXiv:2406.09798, 2024

Zihan Wang, Xiangyang Li, Jiahao Yang, Yeqi Liu, and Shuqiang Jiang. Sim-to-real transfer via 3d feature fields for vision-and-language navigation.arXiv preprint arXiv:2406.09798, 2024. 6

work page arXiv 2024
[33]

Dynam3d: Dynamic layered 3d tokens empower vlm for vision-and- language navigation.arXiv preprint arXiv:2505.11383,

Zihan Wang, Seungjun Lee, and Gim Hee Lee. Dynam3d: Dynamic layered 3d tokens empower vlm for vision-and- language navigation.arXiv preprint arXiv:2505.11383,

work page arXiv
[34]

Streamvln: Streaming vision-and- language navigation via slowfast context modeling.arXiv preprint arXiv:2507.05240, 2025

Meng Wei, Chenyang Wan, Xiqian Yu, Tai Wang, Yuqiang Yang, Xiaohan Mao, Chenming Zhu, Wenzhe Cai, Hanqing Wang, Yilun Chen, et al. Streamvln: Streaming vision-and- language navigation via slowfast context modeling.arXiv preprint arXiv:2507.05240, 2025. 1, 2, 5

work page arXiv 2025
[35]

Towards target-driven visual nav- igation in indoor scenes via generative imitation learning

Qiaoyun Wu, Xiaoxi Gong, Kai Xu, Dinesh Manocha, Jingx- uan Dong, and Jun Wang. Towards target-driven visual nav- igation in indoor scenes via generative imitation learning. IEEE Robotics and Automation Letters, 6(1):175–182, 2020. 2

work page 2020
[36]

Vision-language navigation: a survey and tax- onomy.Neural Computing and Applications, 36(7):3291– 3316, 2024

Wansen Wu, Tao Chang, Xinmeng Li, Quanjun Yin, and Yue Hu. Vision-language navigation: a survey and tax- onomy.Neural Computing and Applications, 36(7):3291– 3316, 2024. 1

work page 2024
[37]

Nav- morph: A self-evolving world model for vision-and- language navigation in continuous environments.arXiv preprint arXiv:2506.23468, 2025

Xuan Yao, Junyu Gao, and Changsheng Xu. Nav- morph: A self-evolving world model for vision-and- language navigation in continuous environments.arXiv preprint arXiv:2506.23468, 2025. 6

work page arXiv 2025
[38]

Uni-NaVid: A Video-based Vision-Language-Action Model for Unifying Embodied Navigation Tasks

Jiazhao Zhang, Kunyu Wang, Shaoan Wang, Minghan Li, Haoran Liu, Songlin Wei, Zhongyuan Wang, Zhizheng Zhang, and He Wang. Uni-navid: A video-based vision- language-action model for unifying embodied navigation tasks.arXiv preprint arXiv:2412.06224, 2024. 2, 5, 6

work page internal anchor Pith review Pith/arXiv arXiv 2024
[39]

NaVid: Video-based VLM Plans the Next Step for Vision-and-Language Navigation

Jiazhao Zhang, Kunyu Wang, Rongtao Xu, Gengze Zhou, Yicong Hong, Xiaomeng Fang, Qi Wu, Zhizheng Zhang, and He Wang. Navid: Video-based vlm plans the next step for vision-and-language navigation.arXiv preprint arXiv:2402.15852, 2024. 2, 6, 7

work page internal anchor Pith review arXiv 2024
[40]

Vision-language navigation with self-supervised auxiliary reasoning tasks

Fengda Zhu, Yi Zhu, Xiaojun Chang, and Xiaodan Liang. Vision-language navigation with self-supervised auxiliary reasoning tasks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10012– 10022, 2020. 1, 2, 7 10

work page 2020

[1] [1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ah- mad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Bevbert: Multimodal map pre-training for language-guided navigation.arXiv preprint arXiv:2212.04385, 2022

Dong An, Yuankai Qi, Yangguang Li, Yan Huang, Liang Wang, Tieniu Tan, and Jing Shao. Bevbert: Multimodal map pre-training for language-guided navigation.arXiv preprint arXiv:2212.04385, 2022. 6

work page arXiv 2022

[3] [3]

Etpnav: Evolving topo- logical planning for vision-language navigation in continu- ous environments.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024

Dong An, Hanqing Wang, Wenguan Wang, Zun Wang, Yan Huang, Keji He, and Liang Wang. Etpnav: Evolving topo- logical planning for vision-language navigation in continu- ous environments.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024. 6

work page 2024

[4] [4]

Vision-and-language navigation: In- terpreting visually-grounded navigation instructions in real environments

Peter Anderson, Qi Wu, Damien Teney, Jake Bruce, Mark Johnson, Niko S ¨underhauf, Ian Reid, Stephen Gould, and Anton Van Den Hengel. Vision-and-language navigation: In- terpreting visually-grounded navigation instructions in real environments. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 3674–3683,

work page

[5] [5]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 2

work page internal anchor Pith review Pith/arXiv arXiv 2025

[6] [6]

Weakly- supervised multi-granularity map learning for vision-and- language navigation.Advances in Neural Information Pro- cessing Systems, 35:38149–38161, 2022

Peihao Chen, Dongyu Ji, Kunyang Lin, Runhao Zeng, Thomas Li, Mingkui Tan, and Chuang Gan. Weakly- supervised multi-granularity map learning for vision-and- language navigation.Advances in Neural Information Pro- cessing Systems, 35:38149–38161, 2022. 6, 7

work page 2022

[7] [7]

a2nav: Action-aware zero-shot robot navigation by exploit- ing vision-and-language ability of foundation models.arXiv preprint arXiv:2308.07997, 2023

Peihao Chen, Xinyu Sun, Hongyan Zhi, Runhao Zeng, Thomas H Li, Gaowen Liu, Mingkui Tan, and Chuang Gan. A2 nav: Action-aware zero-shot robot navigation by exploit- ing vision-and-language ability of foundation models.arXiv preprint arXiv:2308.07997, 2023. 7

work page arXiv 2023

[8] [8]

NaVILA: Legged Robot Vision-Language-Action Model for Naviga- tion

An-Chieh Cheng, Yandong Ji, Zhaojing Yang, Zaitian Gongye, Xueyan Zou, Jan Kautz, Erdem Bıyık, Hongxu Yin, Sifei Liu, and Xiaolong Wang. Navila: Legged robot vision-language-action model for navigation.arXiv preprint arXiv:2412.04453, 2024. 1, 2, 6, 8

work page arXiv 2024

[9] [9]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blis- tein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025. 2

work page internal anchor Pith review Pith/arXiv arXiv 2025

[10] [10]

Cross-modal map learning for vision and language navigation

Georgios Georgakis, Karl Schmeckpeper, Karan Wanchoo, Soham Dan, Eleni Miltsakaki, Dan Roth, and Kostas Dani- ilidis. Cross-modal map learning for vision and language navigation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 15460– 15470, 2022. 6, 7

work page 2022

[11] [11]

Vision-and-language navigation: A sur- vey of tasks, methods, and future directions.arXiv preprint arXiv:2203.12667, 2022

Jing Gu, Eliana Stefani, Qi Wu, Jesse Thomason, and Xin Eric Wang. Vision-and-language navigation: A sur- vey of tasks, methods, and future directions.arXiv preprint arXiv:2203.12667, 2022. 1

work page arXiv 2022

[12] [12]

Beyond the nav-graph: Vision-and-language navigation in continuous environments

Jacob Krantz, Erik Wijmans, Arjun Majumdar, Dhruv Batra, and Stefan Lee. Beyond the nav-graph: Vision-and-language navigation in continuous environments. InComputer Vision– ECCV 2020: 16th European Conference, Glasgow, UK, Au- gust 23–28, 2020, Proceedings, Part XXVIII 16, pages 104–

work page 2020

[13] [13]

Springer, 2020. 5, 6, 7

work page 2020

[14] [14]

Room-across-room: Multilingual vision- and-language navigation with dense spatiotemporal ground- ing.arXiv preprint arXiv:2010.07954, 2020

Alexander Ku, Peter Anderson, Roma Patel, Eugene Ie, and Jason Baldridge. Room-across-room: Multilingual vision- and-language navigation with dense spatiotemporal ground- ing.arXiv preprint arXiv:2010.07954, 2020. 5

work page arXiv 2010

[15] [15]

Navid-4d: Unleashing spatial intel- ligence in egocentric rgb-d videos for vision-and-language navigation

Haoran Liu, Weikang Wan, Xiqian Yu, Minghan Li, Jiazhao Zhang, Bo Zhao, Zhibo Chen, Zhongyuan Wang, Zhizheng Zhang, and He Wang. Navid-4d: Unleashing spatial intel- ligence in egocentric rgb-d videos for vision-and-language navigation. 1, 2, 6

work page

[16] [16]

Nav-r1: Reasoning and navigation in embodied scenes

Qingxiang Liu, Ting Huang, Zeyu Zhang, and Hao Tang. Nav-r1: Reasoning and navigation in embodied scenes. arXiv preprint arXiv:2509.10884, 2025. 2

work page arXiv 2025

[17] [17]

Vision-language navigation with energy-based policy.arXiv preprint arXiv:2410.14250, 2024

Rui Liu, Wenguan Wang, and Yi Yang. Vision-language navigation with energy-based policy.arXiv preprint arXiv:2410.14250, 2024. 6

work page arXiv 2024

[18] [18]

NVILA: Efficient Frontier Visual Language Models

Zhijian Liu, Ligeng Zhu, Baifeng Shi, Zhuoyang Zhang, Yuming Lou, Shang Yang, Haocheng Xi, Shiyi Cao, Yux- ian Gu, Dacheng Li, et al. Nvila: Efficient frontier visual language models.arXiv preprint arXiv:2412.04468, 2024. 2, 5, 6, 8

work page internal anchor Pith review Pith/arXiv arXiv 2024

[19] [19]

The regretful agent: Heuristic-aided navigation through progress estimation

Chih-Yao Ma, Zuxuan Wu, Ghassan AlRegib, Caiming Xiong, and Zsolt Kira. The regretful agent: Heuristic-aided navigation through progress estimation. InProceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, pages 6732–6740, 2019. 2

work page 2019

[20] [20]

Vision-based navigation with language-based assis- tance via imitation learning with indirect intervention

Khanh Nguyen, Debadeepta Dey, Chris Brockett, and Bill Dolan. Vision-based navigation with language-based assis- tance via imitation learning with indirect intervention. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 12527–12537, 2019. 2

work page 2019

[21] [21]

Language-aligned waypoint (LAW) su- pervision for vision-and-language navigation in continuous environments

Sonia Raychaudhuri, Saim Wani, Shivansh Patel, Unnat Jain, and Angel Chang. Language-aligned waypoint (LAW) su- pervision for vision-and-language navigation in continuous environments. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 4018–4028, Online and Punta Cana, Dominican Republic,

work page 2021

[22] [22]

Association for Computational Linguistics. 6, 7

work page

[23] [23]

A re- duction of imitation learning and structured prediction to no- regret online learning

St ´ephane Ross, Geoffrey Gordon, and Drew Bagnell. A re- duction of imitation learning and structured prediction to no- regret online learning. InProceedings of the fourteenth inter- national conference on artificial intelligence and statistics, pages 627–635. JMLR Workshop and Conference Proceed- ings, 2011. 5

work page 2011

[24] [24]

Velma: Verbaliza- tion embodiment of llm agents for vision and language navi- gation in street view

Raphael Schumann, Wanrong Zhu, Weixi Feng, Tsu-Jui Fu, Stefan Riezler, and William Yang Wang. Velma: Verbaliza- tion embodiment of llm agents for vision and language navi- gation in street view. InProceedings of the AAAI Conference on Artificial Intelligence, pages 18924–18933, 2024. 2

work page 2024

[25] [25]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathe- 9 matical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. 4

work page internal anchor Pith review Pith/arXiv arXiv 2024

[26] [26]

One step at a time: Long-horizon vision-and-language navigation with milestones

Chan Hee Song, Jihyung Kil, Tai-Yu Pan, Brian M Sadler, Wei-Lun Chao, and Yu Su. One step at a time: Long-horizon vision-and-language navigation with milestones. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 15482–15491, 2022. 2

work page 2022

[27] [27]

Towards versatile embodied navigation.Advances in neural information processing systems, 35:36858–36874,

Hanqing Wang, Wei Liang, Luc V Gool, and Wenguan Wang. Towards versatile embodied navigation.Advances in neural information processing systems, 35:36858–36874,

work page

[28] [28]

Aux-think: Exploring reason- ing strategies for data-efficient vision-language navigation

Shuo Wang, Yongcai Wang, Wanting Li, Xudong Cai, Yucheng Wang, Maiyue Chen, Kaihui Wang, Zhizhong Su, Deying Li, and Zhaoxin Fan. Aux-think: Exploring reason- ing strategies for data-efficient vision-language navigation. Advances in Neural Information Processing Systems, 2025. 2, 5, 7, 8

work page 2025

[29] [29]

Aux-think: Exploring reason- ing strategies for data-efficient vision-language navigation

Shuo Wang, Yongcai Wang, Wanting Li, Xudong Cai, Yucheng Wang, Maiyue Chen, Kaihui Wang, Zhizhong Su, Deying Li, and Zhaoxin Fan. Aux-think: Exploring reason- ing strategies for data-efficient vision-language navigation. arXiv preprint arXiv:2505.11886, 2025. 2, 6

work page arXiv 2025

[30] [30]

Monodream: Monocular vision-language navigation with panoramic dreaming.arXiv preprint arXiv:2508.02549, 2025

Shuo Wang, Yongcai Wang, Wanting Li, Yucheng Wang, Maiyue Chen, Kaihui Wang, Zhizhong Su, Xudong Cai, Yeying Jin, Deying Li, et al. Monodream: Monocular vision-language navigation with panoramic dreaming.arXiv preprint arXiv:2508.02549, 2025. 2, 6, 7

work page arXiv 2025

[31] [31]

Scaling data generation in vision-and-language navigation

Zun Wang, Jialu Li, Yicong Hong, Yi Wang, Qi Wu, Mohit Bansal, Stephen Gould, Hao Tan, and Yu Qiao. Scaling data generation in vision-and-language navigation. InProceed- ings of the IEEE/CVF international conference on computer vision, pages 12009–12020, 2023. 5

work page 2023

[32] [32]

Sim-to-real transfer via 3d feature fields for vision-and-language navigation.arXiv preprint arXiv:2406.09798, 2024

Zihan Wang, Xiangyang Li, Jiahao Yang, Yeqi Liu, and Shuqiang Jiang. Sim-to-real transfer via 3d feature fields for vision-and-language navigation.arXiv preprint arXiv:2406.09798, 2024. 6

work page arXiv 2024

[33] [33]

Dynam3d: Dynamic layered 3d tokens empower vlm for vision-and- language navigation.arXiv preprint arXiv:2505.11383,

Zihan Wang, Seungjun Lee, and Gim Hee Lee. Dynam3d: Dynamic layered 3d tokens empower vlm for vision-and- language navigation.arXiv preprint arXiv:2505.11383,

work page arXiv

[34] [34]

Streamvln: Streaming vision-and- language navigation via slowfast context modeling.arXiv preprint arXiv:2507.05240, 2025

Meng Wei, Chenyang Wan, Xiqian Yu, Tai Wang, Yuqiang Yang, Xiaohan Mao, Chenming Zhu, Wenzhe Cai, Hanqing Wang, Yilun Chen, et al. Streamvln: Streaming vision-and- language navigation via slowfast context modeling.arXiv preprint arXiv:2507.05240, 2025. 1, 2, 5

work page arXiv 2025

[35] [35]

Towards target-driven visual nav- igation in indoor scenes via generative imitation learning

Qiaoyun Wu, Xiaoxi Gong, Kai Xu, Dinesh Manocha, Jingx- uan Dong, and Jun Wang. Towards target-driven visual nav- igation in indoor scenes via generative imitation learning. IEEE Robotics and Automation Letters, 6(1):175–182, 2020. 2

work page 2020

[36] [36]

Vision-language navigation: a survey and tax- onomy.Neural Computing and Applications, 36(7):3291– 3316, 2024

Wansen Wu, Tao Chang, Xinmeng Li, Quanjun Yin, and Yue Hu. Vision-language navigation: a survey and tax- onomy.Neural Computing and Applications, 36(7):3291– 3316, 2024. 1

work page 2024

[37] [37]

Nav- morph: A self-evolving world model for vision-and- language navigation in continuous environments.arXiv preprint arXiv:2506.23468, 2025

Xuan Yao, Junyu Gao, and Changsheng Xu. Nav- morph: A self-evolving world model for vision-and- language navigation in continuous environments.arXiv preprint arXiv:2506.23468, 2025. 6

work page arXiv 2025

[38] [38]

Uni-NaVid: A Video-based Vision-Language-Action Model for Unifying Embodied Navigation Tasks

Jiazhao Zhang, Kunyu Wang, Shaoan Wang, Minghan Li, Haoran Liu, Songlin Wei, Zhongyuan Wang, Zhizheng Zhang, and He Wang. Uni-navid: A video-based vision- language-action model for unifying embodied navigation tasks.arXiv preprint arXiv:2412.06224, 2024. 2, 5, 6

work page internal anchor Pith review Pith/arXiv arXiv 2024

[39] [39]

NaVid: Video-based VLM Plans the Next Step for Vision-and-Language Navigation

Jiazhao Zhang, Kunyu Wang, Rongtao Xu, Gengze Zhou, Yicong Hong, Xiaomeng Fang, Qi Wu, Zhizheng Zhang, and He Wang. Navid: Video-based vlm plans the next step for vision-and-language navigation.arXiv preprint arXiv:2402.15852, 2024. 2, 6, 7

work page internal anchor Pith review arXiv 2024

[40] [40]

Vision-language navigation with self-supervised auxiliary reasoning tasks

Fengda Zhu, Yi Zhu, Xiaojun Chang, and Xiaodan Liang. Vision-language navigation with self-supervised auxiliary reasoning tasks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10012– 10022, 2020. 1, 2, 7 10

work page 2020