pith. sign in

arxiv: 2511.17097 · v2 · submitted 2025-11-21 · 💻 cs.RO

Progress-Think: Semantic Progress Reasoning for Vision-Language Navigation

Pith reviewed 2026-05-17 20:59 UTC · model grok-4.3

classification 💻 cs.RO
keywords Vision-Language NavigationSemantic Progress ReasoningProgress-Guided PolicyNavigation AdvancementVision-Language-Action ModelsR2R-CERxR-CE
0
0 comments X

The pith

Semantic progress reasoning from visual observations produces a more consistent sense of advancement in vision-language navigation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that agents fail on long instruction sequences because they lack a reliable way to know how far they have come. It proposes predicting progress in the same style as the original instructions, using the fact that visual observations and instruction steps advance together monotonically. A three-stage process first pretrains a module to align visual history with instruction prefixes, then uses those progress states to shape the navigation policy, and finally tunes both together with progress-aware rewards. If correct, this yields higher success and fewer wasted steps on standard benchmarks without needing manual progress labels. The core idea is that language-style progress descriptions give the agent a clearer internal map of its position in the overall task than numeric scores or direct action prediction alone.

Core claim

By predicting instruction-style progress descriptions directly from visual history, the model exploits the monotonic co-progression of observations and instructions to create a more consistent internal representation of how far the agent has advanced through a multi-step navigation command.

What carries the argument

Semantic progress reasoning module that generates progress statements matching instruction prefixes from current visual history, trained via differentiable alignment and then injected into the policy.

If this is right

  • Navigation agents maintain better alignment with the remaining instruction over long horizons.
  • The method reaches state-of-the-art success and efficiency on R2R-CE and RxR-CE.
  • Progress states can be learned without expensive manual annotations through self-aligned pretraining.
  • Joint optimization of the progress module and policy produces mutually reinforcing improvements.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same progress-prediction approach might transfer to other long-horizon tasks that pair visual streams with language goals.
  • If the monotonic alignment holds under real-world sensor noise, explicit metric localization could become less necessary.
  • Different instruction phrasings might require testing to check whether the learned progress representations remain stable.

Load-bearing premise

Visual observations and instruction sequences always advance together in a monotonic way with no major mismatches or reversals.

What would settle it

Training the same navigation backbone with versus without the semantic progress module on R2R-CE and measuring whether success rate or path efficiency shows no gain or a drop.

Figures

Figures reproduced from arXiv: 2511.17097 by Bo Zhang, Deying Li, Guoxin Lian, Kaihui Wang, Maiyue Chen, Shuo Wang, Wanting Li, Yongcai Wang, Yucheng Wang, Yutian Zhou, Zhaoxin Fan, Zhizhong Su.

Figure 1
Figure 1. Figure 1: Our key structural insight in VLN: visual observations [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the Progress-Think framework and annotation-free training pipeline. Compared with the vanilla Vision-Language [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative comparison of progress reasoning quality. Across two representative scenes, we compare how different models infer [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
read the original abstract

Vision-Language Navigation requires agents to act coherently over long horizons by understanding not only local visual context but also how far they have advanced within a multi-step instruction. However, recent Vision-Language-Action models focus on direct action prediction and earlier progress methods predict numeric achievements; both overlook the monotonic co-progression property of the observation and instruction sequences. Building on this insight, Progress-Think introduces semantic progress reasoning, predicting instruction-style progress from visual observations to enable more accurate navigation. To achieve this without expensive annotations, we propose a three-stage framework. In the initial stage, Self-Aligned Progress Pretraining bootstraps a reasoning module via a novel differentiable alignment between visual history and instruction prefixes. Then, Progress-Guided Policy Pretraining injects learned progress states into the navigation context, guiding the policy toward consistent actions. Finally, Progress-Policy Co-Finetuning jointly optimizes both modules with tailored progress-aware reinforcement objectives. Experiments on R2R-CE and RxR-CE show state-of-the-art success and efficiency, demonstrating that semantic progress yields a more consistent representation of navigation advancement.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes Progress-Think, a three-stage framework for Vision-Language Navigation that introduces semantic progress reasoning. Stage 1 performs Self-Aligned Progress Pretraining via a differentiable alignment loss between visual history and instruction prefixes; Stage 2 injects the resulting progress states into a navigation policy; Stage 3 performs joint co-finetuning with progress-aware reinforcement objectives. The central claim is that this yields a more consistent representation of navigation advancement than direct action prediction or numeric progress methods, producing state-of-the-art success and efficiency on R2R-CE and RxR-CE.

Significance. If the empirical claims are substantiated, the work offers a concrete mechanism for exploiting the monotonic co-progression property in VLN without requiring manual progress annotations. The self-aligned pretraining and progress-aware RL objectives constitute a reusable training recipe that could improve long-horizon coherence in embodied agents.

major comments (2)
  1. [§3.1] §3.1 (Self-Aligned Progress Pretraining): the differentiable alignment loss is derived under the assumption that observation sequences and instruction prefixes co-progress monotonically. The manuscript does not demonstrate that the loss remains well-behaved or that the resulting progress states remain informative when the agent executes detours, backtracks, or enters incorrect rooms—precisely the trajectories that occur in deployed VLN policies. A concrete robustness experiment or failure-case analysis on non-monotonic rollouts is required to support the claim that semantic progress yields a more consistent representation.
  2. [Experiments] Experiments section and Table 1: the abstract asserts state-of-the-art results on R2R-CE and RxR-CE, yet the manuscript supplies neither the precise success-rate, SPL, or efficiency numbers, nor ablations isolating the contribution of each stage, nor error analysis on trajectories where monotonicity is violated. Without these, the central empirical claim remains provisional.
minor comments (2)
  1. [Abstract] Abstract: include the key quantitative metrics (success rate, SPL, etc.) and the main ablation result so that the strength of the SOTA claim is immediately visible.
  2. [§3] Notation: define the semantic progress state representation (e.g., token sequence, embedding, or discrete label) explicitly before it is used in the policy context.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and indicate planned revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [§3.1] §3.1 (Self-Aligned Progress Pretraining): the differentiable alignment loss is derived under the assumption that observation sequences and instruction prefixes co-progress monotonically. The manuscript does not demonstrate that the loss remains well-behaved or that the resulting progress states remain informative when the agent executes detours, backtracks, or enters incorrect rooms—precisely the trajectories that occur in deployed VLN policies. A concrete robustness experiment or failure-case analysis on non-monotonic rollouts is required to support the claim that semantic progress yields a more consistent representation.

    Authors: We agree that robustness to non-monotonic trajectories is a key consideration for deployed policies. The pretraining stage uses expert demonstrations that satisfy monotonic co-progression, but we recognize the need to evaluate behavior under detours and backtracks. In the revised manuscript we will add a dedicated robustness subsection with experiments on rollouts containing induced detours, backtracks, and incorrect-room entries. We will report alignment loss values, progress-state informativeness, and comparisons to numeric progress baselines on these cases to demonstrate that semantic progress remains more consistent than alternatives. revision: partial

  2. Referee: [Experiments] Experiments section and Table 1: the abstract asserts state-of-the-art results on R2R-CE and RxR-CE, yet the manuscript supplies neither the precise success-rate, SPL, or efficiency numbers, nor ablations isolating the contribution of each stage, nor error analysis on trajectories where monotonicity is violated. Without these, the central empirical claim remains provisional.

    Authors: The referee correctly notes that the current version does not present the precise numerical results or stage-wise ablations explicitly in the main text. We will revise the Experiments section and Table 1 to report the exact success rates, SPL, and efficiency metrics on R2R-CE and RxR-CE. We will also add comprehensive ablations isolating the contribution of each of the three stages and include an error analysis on trajectories that violate monotonicity, such as those with backtracking or incorrect rooms. revision: yes

Circularity Check

0 steps flagged

Staged pretraining draws on external data; no reduction of target metric to fitted internal parameter

full rationale

The paper's central derivation proceeds via a three-stage pipeline: differentiable alignment in Self-Aligned Progress Pretraining bootstraps a progress module from visual history and instruction prefixes, followed by injection into policy pretraining and joint co-finetuning with progress-aware RL objectives. This chain relies on the stated monotonic co-progression assumption and external visual-instruction data rather than defining the progress prediction as a direct function of the final navigation success metric or fitting it to a subset of the evaluation trajectories. No equation or self-citation is shown to force the claimed consistency benefit by construction; the alignment loss is presented as a novel mechanism whose validity is tested on R2R-CE and RxR-CE benchmarks. The assumption of monotonicity is an explicit modeling choice whose violation would degrade performance, but it does not render the overall result tautological.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The approach rests on one domain assumption about sequence alignment and introduces semantic progress states as a new representational device without external falsification mentioned.

axioms (1)
  • domain assumption The observation and instruction sequences exhibit a monotonic co-progression property.
    Explicitly invoked as the foundational insight enabling semantic progress reasoning.
invented entities (1)
  • semantic progress states no independent evidence
    purpose: Represent navigation advancement in natural-language instruction format to condition the policy.
    New representational construct introduced to replace numeric progress scores.

pith-pipeline@v0.9.0 · 5525 in / 1172 out tokens · 65540 ms · 2026-05-17T20:59:19.480204+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. SpaAct: Spatially-Activated Transition Learning with Curriculum Adaptation for Vision-Language Navigation

    cs.CV 2026-04 unverdicted novelty 6.0

    SpaAct activates spatial awareness in VLMs using action retrospection, future frame prediction, and progressive curriculum learning to reach SOTA on VLN-CE benchmarks.

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages · cited by 1 Pith paper · 7 internal anchors

  1. [1]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ah- mad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,

  2. [2]

    Bevbert: Multimodal map pre-training for language-guided navigation.arXiv preprint arXiv:2212.04385, 2022

    Dong An, Yuankai Qi, Yangguang Li, Yan Huang, Liang Wang, Tieniu Tan, and Jing Shao. Bevbert: Multimodal map pre-training for language-guided navigation.arXiv preprint arXiv:2212.04385, 2022. 6

  3. [3]

    Etpnav: Evolving topo- logical planning for vision-language navigation in continu- ous environments.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024

    Dong An, Hanqing Wang, Wenguan Wang, Zun Wang, Yan Huang, Keji He, and Liang Wang. Etpnav: Evolving topo- logical planning for vision-language navigation in continu- ous environments.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024. 6

  4. [4]

    Vision-and-language navigation: In- terpreting visually-grounded navigation instructions in real environments

    Peter Anderson, Qi Wu, Damien Teney, Jake Bruce, Mark Johnson, Niko S ¨underhauf, Ian Reid, Stephen Gould, and Anton Van Den Hengel. Vision-and-language navigation: In- terpreting visually-grounded navigation instructions in real environments. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 3674–3683,

  5. [5]

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 2

  6. [6]

    Weakly- supervised multi-granularity map learning for vision-and- language navigation.Advances in Neural Information Pro- cessing Systems, 35:38149–38161, 2022

    Peihao Chen, Dongyu Ji, Kunyang Lin, Runhao Zeng, Thomas Li, Mingkui Tan, and Chuang Gan. Weakly- supervised multi-granularity map learning for vision-and- language navigation.Advances in Neural Information Pro- cessing Systems, 35:38149–38161, 2022. 6, 7

  7. [7]

    a2nav: Action-aware zero-shot robot navigation by exploit- ing vision-and-language ability of foundation models.arXiv preprint arXiv:2308.07997, 2023

    Peihao Chen, Xinyu Sun, Hongyan Zhi, Runhao Zeng, Thomas H Li, Gaowen Liu, Mingkui Tan, and Chuang Gan. A2 nav: Action-aware zero-shot robot navigation by exploit- ing vision-and-language ability of foundation models.arXiv preprint arXiv:2308.07997, 2023. 7

  8. [8]

    NaVILA: Legged Robot Vision-Language-Action Model for Naviga- tion

    An-Chieh Cheng, Yandong Ji, Zhaojing Yang, Zaitian Gongye, Xueyan Zou, Jan Kautz, Erdem Bıyık, Hongxu Yin, Sifei Liu, and Xiaolong Wang. Navila: Legged robot vision-language-action model for navigation.arXiv preprint arXiv:2412.04453, 2024. 1, 2, 6, 8

  9. [9]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blis- tein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025. 2

  10. [10]

    Cross-modal map learning for vision and language navigation

    Georgios Georgakis, Karl Schmeckpeper, Karan Wanchoo, Soham Dan, Eleni Miltsakaki, Dan Roth, and Kostas Dani- ilidis. Cross-modal map learning for vision and language navigation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 15460– 15470, 2022. 6, 7

  11. [11]

    Vision-and-language navigation: A sur- vey of tasks, methods, and future directions.arXiv preprint arXiv:2203.12667, 2022

    Jing Gu, Eliana Stefani, Qi Wu, Jesse Thomason, and Xin Eric Wang. Vision-and-language navigation: A sur- vey of tasks, methods, and future directions.arXiv preprint arXiv:2203.12667, 2022. 1

  12. [12]

    Beyond the nav-graph: Vision-and-language navigation in continuous environments

    Jacob Krantz, Erik Wijmans, Arjun Majumdar, Dhruv Batra, and Stefan Lee. Beyond the nav-graph: Vision-and-language navigation in continuous environments. InComputer Vision– ECCV 2020: 16th European Conference, Glasgow, UK, Au- gust 23–28, 2020, Proceedings, Part XXVIII 16, pages 104–

  13. [13]

    Springer, 2020. 5, 6, 7

  14. [14]

    Room-across-room: Multilingual vision- and-language navigation with dense spatiotemporal ground- ing.arXiv preprint arXiv:2010.07954, 2020

    Alexander Ku, Peter Anderson, Roma Patel, Eugene Ie, and Jason Baldridge. Room-across-room: Multilingual vision- and-language navigation with dense spatiotemporal ground- ing.arXiv preprint arXiv:2010.07954, 2020. 5

  15. [15]

    Navid-4d: Unleashing spatial intel- ligence in egocentric rgb-d videos for vision-and-language navigation

    Haoran Liu, Weikang Wan, Xiqian Yu, Minghan Li, Jiazhao Zhang, Bo Zhao, Zhibo Chen, Zhongyuan Wang, Zhizheng Zhang, and He Wang. Navid-4d: Unleashing spatial intel- ligence in egocentric rgb-d videos for vision-and-language navigation. 1, 2, 6

  16. [16]

    Nav-r1: Reasoning and navigation in embodied scenes

    Qingxiang Liu, Ting Huang, Zeyu Zhang, and Hao Tang. Nav-r1: Reasoning and navigation in embodied scenes. arXiv preprint arXiv:2509.10884, 2025. 2

  17. [17]

    Vision-language navigation with energy-based policy.arXiv preprint arXiv:2410.14250, 2024

    Rui Liu, Wenguan Wang, and Yi Yang. Vision-language navigation with energy-based policy.arXiv preprint arXiv:2410.14250, 2024. 6

  18. [18]

    NVILA: Efficient Frontier Visual Language Models

    Zhijian Liu, Ligeng Zhu, Baifeng Shi, Zhuoyang Zhang, Yuming Lou, Shang Yang, Haocheng Xi, Shiyi Cao, Yux- ian Gu, Dacheng Li, et al. Nvila: Efficient frontier visual language models.arXiv preprint arXiv:2412.04468, 2024. 2, 5, 6, 8

  19. [19]

    The regretful agent: Heuristic-aided navigation through progress estimation

    Chih-Yao Ma, Zuxuan Wu, Ghassan AlRegib, Caiming Xiong, and Zsolt Kira. The regretful agent: Heuristic-aided navigation through progress estimation. InProceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, pages 6732–6740, 2019. 2

  20. [20]

    Vision-based navigation with language-based assis- tance via imitation learning with indirect intervention

    Khanh Nguyen, Debadeepta Dey, Chris Brockett, and Bill Dolan. Vision-based navigation with language-based assis- tance via imitation learning with indirect intervention. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 12527–12537, 2019. 2

  21. [21]

    Language-aligned waypoint (LAW) su- pervision for vision-and-language navigation in continuous environments

    Sonia Raychaudhuri, Saim Wani, Shivansh Patel, Unnat Jain, and Angel Chang. Language-aligned waypoint (LAW) su- pervision for vision-and-language navigation in continuous environments. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 4018–4028, Online and Punta Cana, Dominican Republic,

  22. [22]

    Association for Computational Linguistics. 6, 7

  23. [23]

    A re- duction of imitation learning and structured prediction to no- regret online learning

    St ´ephane Ross, Geoffrey Gordon, and Drew Bagnell. A re- duction of imitation learning and structured prediction to no- regret online learning. InProceedings of the fourteenth inter- national conference on artificial intelligence and statistics, pages 627–635. JMLR Workshop and Conference Proceed- ings, 2011. 5

  24. [24]

    Velma: Verbaliza- tion embodiment of llm agents for vision and language navi- gation in street view

    Raphael Schumann, Wanrong Zhu, Weixi Feng, Tsu-Jui Fu, Stefan Riezler, and William Yang Wang. Velma: Verbaliza- tion embodiment of llm agents for vision and language navi- gation in street view. InProceedings of the AAAI Conference on Artificial Intelligence, pages 18924–18933, 2024. 2

  25. [25]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathe- 9 matical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. 4

  26. [26]

    One step at a time: Long-horizon vision-and-language navigation with milestones

    Chan Hee Song, Jihyung Kil, Tai-Yu Pan, Brian M Sadler, Wei-Lun Chao, and Yu Su. One step at a time: Long-horizon vision-and-language navigation with milestones. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 15482–15491, 2022. 2

  27. [27]

    Towards versatile embodied navigation.Advances in neural information processing systems, 35:36858–36874,

    Hanqing Wang, Wei Liang, Luc V Gool, and Wenguan Wang. Towards versatile embodied navigation.Advances in neural information processing systems, 35:36858–36874,

  28. [28]

    Aux-think: Exploring reason- ing strategies for data-efficient vision-language navigation

    Shuo Wang, Yongcai Wang, Wanting Li, Xudong Cai, Yucheng Wang, Maiyue Chen, Kaihui Wang, Zhizhong Su, Deying Li, and Zhaoxin Fan. Aux-think: Exploring reason- ing strategies for data-efficient vision-language navigation. Advances in Neural Information Processing Systems, 2025. 2, 5, 7, 8

  29. [29]

    Aux-think: Exploring reason- ing strategies for data-efficient vision-language navigation

    Shuo Wang, Yongcai Wang, Wanting Li, Xudong Cai, Yucheng Wang, Maiyue Chen, Kaihui Wang, Zhizhong Su, Deying Li, and Zhaoxin Fan. Aux-think: Exploring reason- ing strategies for data-efficient vision-language navigation. arXiv preprint arXiv:2505.11886, 2025. 2, 6

  30. [30]

    Monodream: Monocular vision-language navigation with panoramic dreaming.arXiv preprint arXiv:2508.02549, 2025

    Shuo Wang, Yongcai Wang, Wanting Li, Yucheng Wang, Maiyue Chen, Kaihui Wang, Zhizhong Su, Xudong Cai, Yeying Jin, Deying Li, et al. Monodream: Monocular vision-language navigation with panoramic dreaming.arXiv preprint arXiv:2508.02549, 2025. 2, 6, 7

  31. [31]

    Scaling data generation in vision-and-language navigation

    Zun Wang, Jialu Li, Yicong Hong, Yi Wang, Qi Wu, Mohit Bansal, Stephen Gould, Hao Tan, and Yu Qiao. Scaling data generation in vision-and-language navigation. InProceed- ings of the IEEE/CVF international conference on computer vision, pages 12009–12020, 2023. 5

  32. [32]

    Sim-to-real transfer via 3d feature fields for vision-and-language navigation.arXiv preprint arXiv:2406.09798, 2024

    Zihan Wang, Xiangyang Li, Jiahao Yang, Yeqi Liu, and Shuqiang Jiang. Sim-to-real transfer via 3d feature fields for vision-and-language navigation.arXiv preprint arXiv:2406.09798, 2024. 6

  33. [33]

    Dynam3d: Dynamic layered 3d tokens empower vlm for vision-and- language navigation.arXiv preprint arXiv:2505.11383,

    Zihan Wang, Seungjun Lee, and Gim Hee Lee. Dynam3d: Dynamic layered 3d tokens empower vlm for vision-and- language navigation.arXiv preprint arXiv:2505.11383,

  34. [34]

    Streamvln: Streaming vision-and- language navigation via slowfast context modeling.arXiv preprint arXiv:2507.05240, 2025

    Meng Wei, Chenyang Wan, Xiqian Yu, Tai Wang, Yuqiang Yang, Xiaohan Mao, Chenming Zhu, Wenzhe Cai, Hanqing Wang, Yilun Chen, et al. Streamvln: Streaming vision-and- language navigation via slowfast context modeling.arXiv preprint arXiv:2507.05240, 2025. 1, 2, 5

  35. [35]

    Towards target-driven visual nav- igation in indoor scenes via generative imitation learning

    Qiaoyun Wu, Xiaoxi Gong, Kai Xu, Dinesh Manocha, Jingx- uan Dong, and Jun Wang. Towards target-driven visual nav- igation in indoor scenes via generative imitation learning. IEEE Robotics and Automation Letters, 6(1):175–182, 2020. 2

  36. [36]

    Vision-language navigation: a survey and tax- onomy.Neural Computing and Applications, 36(7):3291– 3316, 2024

    Wansen Wu, Tao Chang, Xinmeng Li, Quanjun Yin, and Yue Hu. Vision-language navigation: a survey and tax- onomy.Neural Computing and Applications, 36(7):3291– 3316, 2024. 1

  37. [37]

    Nav- morph: A self-evolving world model for vision-and- language navigation in continuous environments.arXiv preprint arXiv:2506.23468, 2025

    Xuan Yao, Junyu Gao, and Changsheng Xu. Nav- morph: A self-evolving world model for vision-and- language navigation in continuous environments.arXiv preprint arXiv:2506.23468, 2025. 6

  38. [38]

    Uni-NaVid: A Video-based Vision-Language-Action Model for Unifying Embodied Navigation Tasks

    Jiazhao Zhang, Kunyu Wang, Shaoan Wang, Minghan Li, Haoran Liu, Songlin Wei, Zhongyuan Wang, Zhizheng Zhang, and He Wang. Uni-navid: A video-based vision- language-action model for unifying embodied navigation tasks.arXiv preprint arXiv:2412.06224, 2024. 2, 5, 6

  39. [39]

    NaVid: Video-based VLM Plans the Next Step for Vision-and-Language Navigation

    Jiazhao Zhang, Kunyu Wang, Rongtao Xu, Gengze Zhou, Yicong Hong, Xiaomeng Fang, Qi Wu, Zhizheng Zhang, and He Wang. Navid: Video-based vlm plans the next step for vision-and-language navigation.arXiv preprint arXiv:2402.15852, 2024. 2, 6, 7

  40. [40]

    Vision-language navigation with self-supervised auxiliary reasoning tasks

    Fengda Zhu, Yi Zhu, Xiaojun Chang, and Xiaodan Liang. Vision-language navigation with self-supervised auxiliary reasoning tasks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10012– 10022, 2020. 1, 2, 7 10