pith. sign in

arxiv: 2606.30367 · v1 · pith:JSAV5T4Xnew · submitted 2026-06-29 · 💻 cs.RO

FutureNav: Unified World-Action Modeling for Vision-and-Language Navigation

Pith reviewed 2026-06-30 05:03 UTC · model grok-4.3

classification 💻 cs.RO
keywords vision-and-language navigationVLNworld modelingaction predictiondynamics modelingfuture generationunified frameworkVLM
0
0 comments X

The pith

FutureNav jointly optimizes action policy with dynamics and future generation to achieve SOTA VLN on a 4B model.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents FutureNav as a framework that encodes text, visual, and spatial features together and feeds them into an LLM for vision-and-language navigation. It trains four objectives at once: an action policy for choosing moves, inverse and forward dynamics to capture how states change, and a future generation objective to predict upcoming spatial states. The claim is that this unified training strengthens action decisions by building an explicit world model without slowing down inference. A reader would care if explicit future modeling turns out to be a more effective way to ground instructions across long sequences than methods that generate actions directly.

Core claim

FutureNav is a VLM-based unified world-action modeling framework that jointly encodes text, visual, and spatial features into an LLM and optimizes four objectives simultaneously—an action policy objective for navigation action prediction, inverse and forward dynamics objectives for modeling state transitions, and a future generation objective for predicting future spatial states—to strengthen action prediction while explicitly modeling the world.

What carries the argument

Simultaneous optimization of four objectives (action policy, inverse dynamics, forward dynamics, future generation) on combined text-visual-spatial features inside a single LLM.

If this is right

  • Action prediction improves by explicitly modeling state transitions across sequences.
  • State-of-the-art results are reached on multiple VLN benchmarks using only a 4B-scale backbone.
  • Prior VLN methods are outperformed while inference speed remains unchanged.
  • The architecture supports future development of world-action models for navigation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Learned predictive world models may improve robustness on longer or more complex trajectories.
  • The same joint objective structure could transfer to other embodied tasks that require planning over time.
  • Public release of the code would allow direct tests of whether the four objectives interact without trade-offs.
  • Scaling the unified approach might further lower the backbone size needed for strong navigation performance.

Load-bearing premise

Jointly optimizing the four objectives produces better action prediction than separate training or prior VLN methods without hidden trade-offs in generalization.

What would settle it

An ablation experiment in which removing any one of the four objectives yields equal or better benchmark scores than the full joint model.

read the original abstract

Vision-and-language navigation (VLN) in continuous environments requires an agent to ground instructions in egocentric observations while maintaining spatial understanding across long action sequences. Recent navigation foundation models have shown strong progress by scaling vision-language models, but they often learn navigation primarily as direct action generation, without explicitly modeling world states or predicting their future evolution. We introduce FutureNav, a VLM-based unified world-action modeling framework for vision-and-language navigation. Specifically, FutureNav jointly encodes text, visual, and spatial features and feeds them into the LLM, and optimizes four objectives for simultaneous world and action modeling: an action policy objective for navigation action prediction, inverse and forward dynamics objectives for modeling state transitions, and a future generation objective for predicting future spatial states. This unified architecture strengthens action prediction while explicitly modeling the world, without sacrificing inference speed. Extensive experiments show that, with only a 4B-scale backbone, FutureNav achieves state-of-the-art performance on multiple VLN benchmarks and substantially outperforms prior VLN methods, paving the way toward future world-action models for VLN. We will release the code and models to support future research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper introduces FutureNav, a VLM-based framework for vision-and-language navigation in continuous environments. It jointly encodes text, visual, and spatial features into an LLM and optimizes four objectives simultaneously: action policy for navigation, inverse dynamics, forward dynamics, and future generation for spatial state prediction. The central claim is that this unified world-action modeling yields state-of-the-art performance on multiple VLN benchmarks using only a 4B-scale backbone, outperforming prior VLN methods while preserving inference speed.

Significance. If the empirical results hold with proper controls, the work would demonstrate that explicit joint optimization of dynamics and future prediction objectives can strengthen action prediction in VLN without added inference cost, potentially shifting the field toward integrated world-action foundation models. The release of code and models would further support reproducibility.

major comments (2)
  1. [Abstract] Abstract: the claim that FutureNav 'achieves state-of-the-art performance on multiple VLN benchmarks and substantially outperforms prior VLN methods' with a 4B backbone supplies no quantitative numbers, baseline tables, ablation results, or error analysis. This absence makes it impossible to verify whether gains are attributable to the four-objective joint training rather than backbone scale, data, or architecture alone.
  2. [Abstract] The central claim requires that joint optimization of action policy + inverse/forward dynamics + future generation improves action prediction over prior methods or action-policy-only training. No controlled ablation (full model vs. reduced-objective variants) is referenced, leaving the causal contribution of the world-modeling objectives untested and the performance attribution open to alternative explanations.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract. We address the two major comments below and will revise the abstract accordingly to improve clarity and support for the claims.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that FutureNav 'achieves state-of-the-art performance on multiple VLN benchmarks and substantially outperforms prior VLN methods' with a 4B backbone supplies no quantitative numbers, baseline tables, ablation results, or error analysis. This absence makes it impossible to verify whether gains are attributable to the four-objective joint training rather than backbone scale, data, or architecture alone.

    Authors: We agree that the abstract would benefit from including key quantitative results to allow readers to assess the performance claims directly. In the revised version, we will add specific metrics (e.g., success rate improvements on the primary VLN benchmarks) and a brief comparison to prior methods while remaining within abstract length constraints. revision: yes

  2. Referee: [Abstract] The central claim requires that joint optimization of action policy + inverse/forward dynamics + future generation improves action prediction over prior methods or action-policy-only training. No controlled ablation (full model vs. reduced-objective variants) is referenced, leaving the causal contribution of the world-modeling objectives untested and the performance attribution open to alternative explanations.

    Authors: The manuscript body contains controlled ablation studies comparing the full four-objective model against reduced-objective variants to isolate the contribution of the dynamics and future-generation terms. These results are not referenced in the current abstract. We will revise the abstract to explicitly note that ablation experiments support the benefit of joint world-action modeling. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical claims rest on external benchmarks

full rationale

The paper is an empirical ML contribution describing a VLM-based agent trained with four joint objectives (action policy, inverse/forward dynamics, future generation). No mathematical derivation, equations, or parameter-fitting steps are present in the provided text that would reduce any claimed prediction to its own inputs by construction. No self-citation load-bearing, uniqueness theorems, or ansatz smuggling appear. Performance claims are asserted via benchmark results rather than internal reductions, satisfying the condition for a self-contained empirical result against external data.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review performed on abstract only; no explicit free parameters, axioms, or invented entities are stated in the provided text.

pith-pipeline@v0.9.1-grok · 5757 in / 1136 out tokens · 25630 ms · 2026-06-30T05:03:34.957682+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

80 extracted references · 37 canonical work pages · 17 internal anchors

  1. [1]

    Vision-and- language navigation: Interpreting visually-grounded navigation instructions in real environments

    Peter Anderson, Qi Wu, Damien Teney, Jake Bruce, Mark Johnson, Niko Sünderhauf, Ian Reid, Stephen Gould, and Anton Van Den Hengel. Vision-and- language navigation: Interpreting visually-grounded navigation instructions in real environments. InPro- ceedings of the IEEE conference on computer vision and pattern recognition, pages 3674–3683, 2018

  2. [2]

    Beyond the nav-graph: Vision-and-language navigation in continuous envi- ronments

    Jacob Krantz, Erik Wijmans, Arjun Majumdar, Dhruv Batra, and Stefan Lee. Beyond the nav-graph: Vision-and-language navigation in continuous envi- ronments. InEuropean Conference on Computer Vision, pages 104–120. Springer, 2020

  3. [3]

    Room-across-room: Multilingual vision-and-language navigation with dense spatiotemporal grounding.arXiv preprint arXiv:2010.07954, 2020

    Alexander Ku, Peter Anderson, Roma Patel, Eu- gene Ie, and Jason Baldridge. Room-across-room: Multilingual vision-and-language navigation with dense spatiotemporal grounding.arXiv preprint arXiv:2010.07954, 2020

  4. [4]

    Waypoint mod- els for instruction-guided navigation in continuous environments

    Jacob Krantz, Aaron Gokaslan, Dhruv Batra, Stefan Lee, and Oleksandr Maksymets. Waypoint mod- els for instruction-guided navigation in continuous environments. InProceedings of the IEEE/CVF In- ternational Conference on Computer Vision, pages 15162–15171, 2021

  5. [5]

    Bridging the gap between learning in discrete andcontinuousenvironmentsforvision-and-language navigation

    Yicong Hong, Zun Wang, Qi Wu, and Stephen Gould. Bridging the gap between learning in discrete andcontinuousenvironmentsforvision-and-language navigation. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 15439–15449, 2022

  6. [6]

    Learning navigational visual representations with semantic map supervision

    Yicong Hong, Yang Zhou, Ruiyi Zhang, Franck Der- noncourt, Trung Bui, Stephen Gould, and Hao Tan. Learning navigational visual representations with semantic map supervision. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 3055–3067, 2023

  7. [7]

    Gridmm: Grid memory map for vision-and-language navigation

    Zihan Wang, Xiangyang Li, Jiahao Yang, Yeqi Liu, and Shuqiang Jiang. Gridmm: Grid memory map for vision-and-language navigation. InProceedings of the IEEE/CVF International conference on computer vision, pages 15625–15636, 2023

  8. [8]

    Dreamwalker: Mental planning for continuous vision-language navigation

    Hanqing Wang, Wei Liang, Luc Van Gool, and Wen- guan Wang. Dreamwalker: Mental planning for continuous vision-language navigation. InProceed- ings of the IEEE/CVF international conference on computer vision, pages 10873–10883, 2023

  9. [9]

    Mapnav: A novel memory representation via annotated semantic maps for vlm-based vision-and- language navigation

    Lingfeng Zhang, Xiaoshuai Hao, Qinwen Xu, Qiang Zhang, Xinyao Zhang, Pengwei Wang, Jing Zhang, Zhongyuan Wang, Shanghang Zhang, and Renjing Xu. Mapnav: A novel memory representation via annotated semantic maps for vlm-based vision-and- language navigation. InProceedings of the 63rd Annual Meeting of the Association for Computational 10 Linguistics (Volume...

  10. [10]

    Toponav: Topological graphs as a key enabler for advanced object navigation

    Peng Liu, Qiang Zhang, Di Peng, Lingfeng Zhang, Yiran Qin, Huan Zhou, Jiajun Ma, Renjing Xu, and Yandong Ji. Toponav: Topological graphs as a key enabler for advanced object navigation. InIEEE In- ternational Conference on Robotics and Automation, 2026

  11. [11]

    Trihelper: Zero-shot object navigation with dynamic assistance

    Lingfeng Zhang, Qiang Zhang, Hao Wang, Erjia Xiao, Zixuan Jiang, Honglei Chen, and Renjing Xu. Trihelper: Zero-shot object navigation with dynamic assistance. InIEEE/RSJ International Conference on Intelligent Robots and Systems, 2024. URLhttps: //arxiv.org/abs/2403.15223

  12. [12]

    Multi-floor zero-shot object navigation policy

    Lingfeng Zhang, Hanqing Wang, Enyu Xiao, Xinyao Zhang, Qiang Zhang, Zihan Jiang, and Renjing Xu. Multi-floor zero-shot object navigation policy. In IEEE International Conference on Robotics and Au- tomation, 2025

  13. [13]

    Stairway to success: An online floor- awarezero-shotobject-goalnavigationframeworkvia llm-driven coarse-to-fine exploration.IEEE Robotics and Automation Letters, 2026

    Zeying Gong, Rui Li, Tianyu Hu, Ruofei Qiu, Linghe Kong, Lingfeng Zhang, Guoqing Zhao, Yu Ding, and Junwei Liang. Stairway to success: An online floor- awarezero-shotobject-goalnavigationframeworkvia llm-driven coarse-to-fine exploration.IEEE Robotics and Automation Letters, 2026

  14. [14]

    Socialnav-map: Dynamic mapping with human tra- jectory prediction for zero-shot social navigation

    Lingfeng Zhang, Erjia Xiao, Xiaoshuai Hao, Haox- iang Fu, Zeying Gong, Long Chen, Xiaojun Liang, Renjing Xu, Hangjun Ye, and Wenbo Ding. Socialnav-map: Dynamic mapping with human tra- jectory prediction for zero-shot social navigation. arXiv preprint arXiv:2511.12232, 2025. URLhttps: //arxiv.org/abs/2511.12232

  15. [15]

    Walk With Me: Long-Horizon Social Navigation for Human-Centric Outdoor Assistance

    Lingfeng Zhang, Xiaoshuai Hao, Xinyu Bu, Yingbo Tang, Haoran Li, Jian Lu, Xinyu Wei, Jiajun Ma, Yichen Liu, Jing Zhang, et al. Walk with me: Long- horizon social navigation for human-centric outdoor assistance.arXiv preprint arXiv:2604.26839, 2026. URLhttps://arxiv.org/abs/2604.26839

  16. [16]

    The robosense challenge: Sense anything, navigate anywhere, adapt across platforms.arXiv preprint, 2026

    Linghe Kong, Sicheng Xie, Zeying Gong, Yuxiang Li, Min Chu, An Liang, Yuhang Dong, Tianyu Hu, Ruofei Qiu, Rui Li, et al. The robosense challenge: Sense anything, navigate anywhere, adapt across platforms.arXiv preprint, 2026

  17. [17]

    Team xiaomi ev-ad vla: Caption- guided retrieval system for cross-modal drone nav- igation – technical report for iros 2025 robosense challenge track 4

    Lingfeng Zhang, Enyu Xiao, Yifan Zhang, Haoxiang Fu, Rui Hu, Yuchen Ma, Wenbo Ding, Long Chen, Hang Ye, et al. Team xiaomi ev-ad vla: Caption- guided retrieval system for cross-modal drone nav- igation – technical report for iros 2025 robosense challenge track 4. Technical report, IROS 2025 Ro- boSense Challenge, 2025

  18. [18]

    Learning to navigate socially through proactive risk perception – technical report for iros 2025 robosense challenge social navigation track

    Enyu Xiao, Lingfeng Zhang, Yingbo Tang, Haoran Cheng, Renjing Xu, Wenbo Ding, Li Zhou, Long Chen, Hang Ye, et al. Learning to navigate socially through proactive risk perception – technical report for iros 2025 robosense challenge social navigation track. Technical report, IROS 2025 RoboSense Chal- lenge, 2025

  19. [19]

    Mapfusion: A novel bev feature fusion network for multi-modal map construc- tion.Information Fusion, 119:103018, 2025

    Xiaoshuai Hao, Yunfeng Diao, Mengchuan Wei, Yi- fan Yang, Peng Hao, Rong Yin, Hui Zhang, Weiming Li, Shu Zhao, and Yu Liu. Mapfusion: A novel bev feature fusion network for multi-modal map construc- tion.Information Fusion, 119:103018, 2025

  20. [20]

    Synergistic prompting for comple- mentarity and consistency in incomplete multi-view clustering.IEEE Transactions on Image Processing, 2026

    Xiaoshuai Hao, Zhihui Zhang, Yingbo Tang, Lingfeng Zhang, Peng Hao, Yunfeng Diao, Guangyin Jin, and Yu Liu. Synergistic prompting for comple- mentarity and consistency in incomplete multi-view clustering.IEEE Transactions on Image Processing, 2026

  21. [21]

    Embodied spatial affordance: spatial-aware affordance learning for embodied nav- igation and manipulation.IEEE Transactions on Image Processing, 2026

    Xiaoshuai Hao, Yingbo Tang, Lingfeng Zhang, Long Chen, Wei Zhou, Jungong Han, Wenbo Ding, and Xiao-Ping Zhang. Embodied spatial affordance: spatial-aware affordance learning for embodied nav- igation and manipulation.IEEE Transactions on Image Processing, 2026

  22. [22]

    Embodiedplan- 1k: A benchmark for complex navigation- manipulation task planning

    Lingfeng Zhang, Yingbo Tang, Xinyu Zheng, Liang Li, Jinglin Xu, and Xiaoshuai Hao. Embodiedplan- 1k: A benchmark for complex navigation- manipulation task planning. 2026

  23. [23]

    NaVid: Video-based VLM Plans the Next Step for Vision-and-Language Navigation

    Jiazhao Zhang, Kunyu Wang, Rongtao Xu, Gengze Zhou, Yicong Hong, Xiaomeng Fang, Qi Wu, Zhizheng Zhang, and He Wang. Navid: Video-based vlm plans the next step for vision-and-language nav- igation.arXiv preprint arXiv:2402.15852, 2024

  24. [24]

    Uni-NaVid: A Video-based Vision-Language-Action Model for Unifying Embodied Navigation Tasks

    Jiazhao Zhang, Kunyu Wang, Shaoan Wang, Ming- han Li, Haoran Liu, Songlin Wei, Zhongyuan Wang, Zhizheng Zhang, and He Wang. Uni-navid: A video-based vision-language-action model for uni- fying embodied navigation tasks.arXiv preprint arXiv:2412.06224, 2024

  25. [25]

    Navila: Legged robot vision-language-action model for navigation.arXiv preprint arXiv:2412.04453, 2024

    An-Chieh Cheng, Yandong Ji, Zhaojing Yang, Zaitian Gongye, Xueyan Zou, Jan Kautz, Erdem Bıyık, Hongxu Yin, Sifei Liu, and Xiaolong Wang. Navila: Legged robot vision-language-action model for navigation.arXiv preprint arXiv:2412.04453, 2024

  26. [26]

    Streamvln: Streaming vision-and-language naviga- tion via slowfast context modeling.arXiv preprint arXiv:2507.05240, 2025

    Meng Wei, Chenyang Wan, Xiqian Yu, Tai Wang, Yuqiang Yang, Xiaohan Mao, Chenming Zhu, Wenzhe Cai, Hanqing Wang, Yilun Chen, et al. Streamvln: Streaming vision-and-language naviga- tion via slowfast context modeling.arXiv preprint arXiv:2507.05240, 2025

  27. [27]

    Embodied Navigation Foundation Model

    Jiazhao Zhang, Anqi Li, Yunpeng Qi, Minghan Li, Jiahang Liu, Shaoan Wang, Haoran Liu, Gengze Zhou, Yuze Wu, Xingxing Li, Yuxin Fan, Wenjun Li, Zhibo Chen, Fei Gao, Qi Wu, Zhizheng Zhang, and He Wang. Embodied Navigation Foundation Model. InInternational Conference on Learning Representations, 2026. URL https://openreview. net/forum?id=kkBOIsrCXh. 11

  28. [29]

    URLhttps://arxiv.org/abs/2508.04598

  29. [30]

    Robobrain 2.0 technical report.arXiv preprint arXiv:2507.02029, 2025

    BAAI RoboBrain Team, Mingyu Cao, Huajie Tan, Yuheng Ji, Xiansheng Chen, Minglan Lin, Zhiyu Li, Zhou Cao, Pengwei Wang, Enshen Zhou, et al. Robobrain 2.0 technical report.arXiv preprint arXiv:2507.02029, 2025. URL https://arxiv.org/ abs/2507.02029

  30. [31]

    MiMo-Embodied: X-Embodied Foundation Model Technical Report

    Xiaoshuai Hao, Lei Zhou, Zhijian Huang, Zhiwen Hou, Yingbo Tang, Lingfeng Zhang, Guang Li, Zheng Lu, Shuhuai Ren, et al. Mimo-embodied: X- embodied foundation model technical report.arXiv preprint arXiv:2511.16518, 2025. URL https:// arxiv.org/abs/2511.16518

  31. [32]

    OneVLA: A Unified Framework for Embodied Tasks

    Lingfeng Zhang, Xiaoshuai Hao, Yingbo Tang, Lei Zhou, Shuyi Zhang, Jinkun Liu, Hongsheng Li, Chen- hao Zhang, Qiang Zhang, Hangjun Ye, Xiaojun Liang, Long Chen, and Wenbo Ding. Onevla: A uni- fied framework for embodied tasks.arXiv preprint arXiv:2606.01241, 2026. URL https://arxiv.org/ abs/2606.01241

  32. [33]

    Xiaomi OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation

    Jian Lu, Jian Guan, Zhaorui Huang, Jiacheng Li, Guoqiang Li, Linghe Kong, Yuxiang Li, Hao- ran Wang, Shiyu Xu, Yifan Luo, Fei Li, et al. Onevl: One-step latent reasoning and planning with vision-language explanation.arXiv preprint arXiv:2604.18486, 2026. URL https://arxiv.org/ abs/2604.18486

  33. [34]

    Video-cot: A com- prehensive dataset for spatiotemporal understanding of videos based on chain-of-thought

    Shilong Zhang, Xiaoshuai Hao, Yingbo Tang, Lingfeng Zhang, Pengwei Wang, Zhongyuan Wang, Huan Ma, and Shanghang Zhang. Video-cot: A com- prehensive dataset for spatiotemporal understanding of videos based on chain-of-thought. InProceedings of the ACM International Conference on Multimedia, 2025

  34. [35]

    Is your vlm sky-ready? a comprehensive spatial intelligence benchmark for uav navigation

    Lingfeng Zhang, Yifan Zhang, Haoran Li, Haoxiang Fu, Yingbo Tang, Hang Ye, Long Chen, Xiaojun Liang, Xiaoshuai Hao, et al. Is your vlm sky-ready? a comprehensive spatial intelligence benchmark for uav navigation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recog- nition, 2026

  35. [37]

    URLhttps://arxiv.org/abs/2503.09010

  36. [38]

    Roboafford: A dataset and benchmark for enhancing object and spatial affordance learning in robot manipulation

    Yingbo Tang, Lingfeng Zhang, Shilong Zhang, Yifan Zhao, and Xiaoshuai Hao. Roboafford: A dataset and benchmark for enhancing object and spatial affordance learning in robot manipulation. InPro- ceedings of the ACM International Conference on Multimedia, 2025

  37. [39]

    Roboafford++: A gener- ative ai-enhanced dataset for multimodal affordance learning in robotic manipulation and navigation

    Xiaoshuai Hao, Yingbo Tang, Lingfeng Zhang, Yuchen Ma, Yuhang Diao, Zihan Jia, Wenbo Ding, Hang Ye, and Long Chen. Roboafford++: A gener- ative ai-enhanced dataset for multimodal affordance learning in robotic manipulation and navigation. In IEEE/RSJ International Conference on Intelligent Robots and Systems Workshop on RoDGE, 2025

  38. [40]

    Evaluating gpt-4o’s embodied intelligence: A comprehensive empirical study

    Yifan Wu, Haoran Lyu, Yingbo Tang, Lingfeng Zhang, Ziheng Zhang, Wenxuan Zhou, and Shibo Hao. Evaluating gpt-4o’s embodied intelligence: A comprehensive empirical study. Technical report, Technical Report, 2025

  39. [41]

    Exploring typographic visual prompts injection threats in cross-modality genera- tion models

    Haoran Cheng, Enyu Xiao, Yujie Wang, Lingfeng Zhang, Kaidi Xu, Meng Sun, Xiaoshuai Hao, Jinjin Gu, and Renjing Xu. Exploring typographic visual prompts injection threats in cross-modality genera- tion models. InInternational Joint Conference on Artificial Intelligence Workshop on Deepfake Detec- tion, Localization and Interpretability, 2025

  40. [42]

    Vquala 2025 challenge on engagement prediction for short videos: Methods and results

    Dong Li, Shuang Ma, Hang Hua, Wei Li, Jian Wang, Chengwei Zhou, Feng Guan, Xin Li, Zhi Yu, Yao Lu, et al. Vquala 2025 challenge on engagement prediction for short videos: Methods and results. In IEEE/CVF International Conference on Computer Vision Workshop, 2025

  41. [43]

    H2r-bm: Can leveraging human videos enhance per- formance and generalizability in robotic bimanual manipulation?Pattern Recognition, page 113637, 2026

    Xiaoshuai Hao, Haoran Lyu, Lingfeng Zhang, Ruidong Liu, Di Wu, Jing Zhang, and Long Chen. H2r-bm: Can leveraging human videos enhance per- formance and generalizability in robotic bimanual manipulation?Pattern Recognition, page 113637, 2026

  42. [44]

    What you see is what you reach: Towards spatial navigation with high-level human instructions

    Lingfeng Zhang, Haoxiang Fu, Xiaoshuai Hao, Shi- long Zhang, Qiang Zhang, Ruidong Liu, Long Chen, and Wenbo Ding. What you see is what you reach: Towards spatial navigation with high-level human instructions. InProceedings of the AAAI Conference on Artificial Intelligence, 2026

  43. [45]

    Mesh- mimic: Geometry-aware humanoid motion learning through 3d scene reconstruction.arXiv preprint arXiv:2602.15733, 2026

    Qiang Zhang, Jiajun Ma, Peng Liu, Shiyu Shi, Zhiqiang Su, Zhongyuan Wang, Jiaming Sun, Wenx- uan Cui, Jia Yu, Guang Han, et al. Mesh- mimic: Geometry-aware humanoid motion learning through 3d scene reconstruction.arXiv preprint arXiv:2602.15733, 2026. URL https://arxiv.org/ abs/2602.15733

  44. [46]

    JanusVLN: Decoupling Se- mantics and Spatiality with Dual Implicit Memory for Vision-Language Navigation

    Shuang Zeng, Dekang Qi, Xinyuan Chang, Feng Xiong, Shichao Xie, Xiaolong Wu, Shiyi Liang, Mu Xu, and Xing Wei. JanusVLN: Decoupling Se- mantics and Spatiality with Dual Implicit Memory for Vision-Language Navigation. InInternational Conference on Learning Representations, 2026. URL https://openreview.net/forum?id=RnuB0Nlbd5. 12

  45. [47]

    Navigation world models

    AmirBar, GaoyueZhou, DannyTran, TrevorDarrell, and Yann LeCun. Navigation world models. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15791–15801,

  46. [48]

    URLhttps://arxiv.org/abs/2412.03572

  47. [49]

    WorldVLN: Autoregressive World Action Model for Aerial Vision-Language Navigation

    Baining Zhao, Jiacheng Xu, Weicheng Feng, Xin Zhang, Zhaolu Wang, Haoyang Wang, Shilong Ji, Ziyou Wang, Jianjie Fang, Zhiheng Zheng, Weichen Zhang, Yu Shang, Wei Wu, Chen Gao, Xinlei Chen, and Yong Li. Worldvln: Autoregressive world action model for aerial vision-language navigation.arXiv preprint arXiv:2605.15964, 2026. URL https:// arxiv.org/abs/2605.15964

  48. [50]

    WAM-Nav: Asymmetric Latent World-Action Modeling for Unified Visual Navigation

    Ning Yang, Yan Huang, Kaiwen Peng, Ziheng He, Kai Wang, Cui Miao, Kailin Lyu, Guo Li, Xiaofeng Wang, Zheng Zhu, Jing Liu, and Nianfeng Liu. Wam-nav: Asymmetric latent world-action model- ing for unified visual navigation.arXiv preprint arXiv:2606.04907, 2026. URL https://arxiv.org/ abs/2606.04907

  49. [51]

    Navforesee: A unified vision-language world model for hierarchi- cal planning and dual-horizon navigation prediction

    Fei Liu, Shichao Xie, Minghua Luo, Zedong Chu, Junjun Hu, Xiaolong Wu, and Mu Xu. Navforesee: A unified vision-language world model for hierarchi- cal planning and dual-horizon navigation prediction. arXiv preprint arXiv:2512.01550, 2025

  50. [52]

    AstraNav-World: World Model for Foresight Control and Consistency

    Junjun Hu, Jintao Chen, Haochen Bai, Minghua Luo, Shichao Xie, Ziyi Chen, Fei Liu, Zedong Chu, Xinda Xue, Botao Ren, et al. Astranav-world: World model for foresight control and consistency.arXiv preprint arXiv:2512.21714, 2025

  51. [53]

    Nav- morph: A self-evolving world model for vision-and- language navigation in continuous environments

    Xuan Yao, Junyu Gao, and Changsheng Xu. Nav- morph: A self-evolving world model for vision-and- language navigation in continuous environments. arXiv preprint arXiv:2506.23468, 2025

  52. [54]

    OA-WAM: Object-Addressable World Action Model for Robust Robot Manipulation

    Yichen Liu, Peng Sun, Shuo Li, Yuxuan Xie, Lingfeng Zhang, Xingyu Chao, Siyuan Dong, Fei Chen, Xiaoping Zhang, et al. Oa-wam: Object- addressable world action model for robust robot ma- nipulation.arXiv preprint arXiv:2605.06481, 2026. URLhttps://arxiv.org/abs/2605.06481

  53. [55]

    Thinking in Text and Images: Interleaved Vision--Language Reasoning Traces for Long-Horizon Robot Manipulation

    Jiahao Liu, Haoran Chi, Lingfeng Zhang, Yuxuan Xie, Yanan Wang, Long Chen, Hang Ye, Xiaoshuai Hao, and Wenbo Ding. Thinking in text and im- ages: Interleaved vision-language reasoning traces for long-horizon robot manipulation.arXiv preprint arXiv:2605.00438, 2026. URL https://arxiv.org/ abs/2605.00438

  54. [56]

    Reasoning emerges from constrained inference manifolds in large language models

    Yuchen Ma, Fei Luo, Lingfeng Zhang, Chen Zhao, Ming Wang, Yifan Wu, Ziyu Qian, Yao Lu, Long Chen, et al. Reasoning emerges from constrained inference manifolds in large language models.arXiv preprint arXiv:2605.08142, 2026. URL https:// arxiv.org/abs/2605.08142

  55. [57]

    Sef-map: Subspace-decomposed expert fusion for robust multimodal hd map predic- tion

    Haoxiang Fu, Lingfeng Zhang, Haoran Li, Rui Hu, Zi- han Li, Guoqing Liu, Zeyu Tan, Long Chen, Hang Ye, and Xiaoshuai Hao. Sef-map: Subspace-decomposed expert fusion for robust multimodal hd map predic- tion. InIEEE International Conference on Robotics and Automation, 2026

  56. [59]

    URLhttps://arxiv.org/abs/2604.05405

  57. [60]

    Pathdreamer: A world model for indoor navigation.arXiv preprint arXiv:2105.08756, 2021

    Jing Yu Koh, Honglak Lee, Yinfei Yang, Jason Baldridge, and Peter Anderson. Pathdreamer: A world model for indoor navigation.arXiv preprint arXiv:2105.08756, 2021. URL https://arxiv.org/ abs/2105.08756

  58. [61]

    Vistav2: World imagination for indoor vision-and-language naviga- tion.arXiv preprint arXiv:2512.00041, 2025

    Yanjia Huang, Xianshun Jiang, Xiangbo Gao, Mingyang Wu, and Zhengzhong Tu. Vistav2: World imagination for indoor vision-and-language naviga- tion.arXiv preprint arXiv:2512.00041, 2025. URL https://arxiv.org/abs/2512.00041

  59. [62]

    Navcot: Boosting llm- based vision-and-language navigation via learning disentangled reasoning.IEEE Transactions on Pat- tern Analysis and Machine Intelligence, 2025

    Bingqian Lin, Yunshuang Nie, Ziming Wei, Jiaqi Chen, Shikui Ma, Jianhua Han, Hang Xu, Xiaojun Chang, and Xiaodan Liang. Navcot: Boosting llm- based vision-and-language navigation via learning disentangled reasoning.IEEE Transactions on Pat- tern Analysis and Machine Intelligence, 2025. URL https://arxiv.org/abs/2403.07376

  60. [63]

    Monodream: Monocular vision-language nav- igation with panoramic dreaming.arXiv preprint arXiv:2508.02549, 2025

    Shuo Wang, Yongcai Wang, Wanting Li, Yucheng Wang, Maiyue Chen, Kaihui Wang, Zhizhong Su, Xudong Cai, Yeying Jin, Deying Li, and Zhaoxin Fan. Monodream: Monocular vision-language nav- igation with panoramic dreaming.arXiv preprint arXiv:2508.02549, 2025. URL https://arxiv.org/ abs/2508.02549

  61. [64]

    Fast-WAM: Do World Action Models Need Test-time Future Imagination?

    Tianyuan Yuan, Zibin Dong, Yicheng Liu, and Hang Zhao. Fast-wam: Do world action models need test-time future imagination?arXiv preprint arXiv:2603.16666, 2026

  62. [65]

    SpaAct: Spatially-Activated Transition Learning with Curriculum Adaptation for Vision-Language Navigation

    Pengna Li, Kangyi Wu, Shaoqing Xu, Fang Li, Han- bing Li, Lin Zhao, Kailin Lyu, Long Chen, Zhi- Xin Yang, and Nanning Zheng. Spaact: Spatially- activated transition learning with curriculum adap- tation for vision-language navigation.arXiv preprint arXiv:2604.27620, 2026

  63. [66]

    Vggt: Visual geometry grounded trans- former

    Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded trans- former. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5294–5306, 2025

  64. [67]

    Sim-2-sim transfer for vision-and-language navigation in continuous en- vironments

    Jacob Krantz and Stefan Lee. Sim-2-sim transfer for vision-and-language navigation in continuous en- vironments. InEuropean conference on computer vision, pages 588–603. Springer, 2022. 13

  65. [68]

    1st place solutions for rxr-habitat vision-and- language navigation competition (cvpr 2022).arXiv preprint arXiv:2206.11610, 2022

    Dong An, Zun Wang, Yangguang Li, Yi Wang, Yi- cong Hong, Yan Huang, Liang Wang, and Jing Shao. 1st place solutions for rxr-habitat vision-and- language navigation competition (cvpr 2022).arXiv preprint arXiv:2206.11610, 2022

  66. [69]

    Instructnav: Zero-shot system for generic instruction naviga- tion in unexplored environment.arXiv preprint arXiv:2406.04882, 2024

    Yuxing Long, Wenzhe Cai, Hongcheng Wang, Guanqi Zhan, and Hao Dong. Instructnav: Zero-shot system for generic instruction naviga- tion in unexplored environment.arXiv preprint arXiv:2406.04882, 2024

  67. [70]

    Cosmo: Combina- tion of selective memorization for low-cost vision- and-language navigation

    Siqi Zhang, Yanyuan Qiao, Qunbo Wang, Zike Yan, Qi Wu, Zhihua Wei, and Jing Liu. Cosmo: Combina- tion of selective memorization for low-cost vision- and-language navigation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 5511–5522, 2025

  68. [71]

    Affordances- oriented planning using foundation models for con- tinuous vision-language navigation

    Jiaqi Chen, Bingqian Lin, Xinmin Liu, Lin Ma, Xi- aodan Liang, and Kwan-Yee K Wong. Affordances- oriented planning using foundation models for con- tinuous vision-language navigation. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 23568–23576, 2025

  69. [72]

    Language-aligned waypoint (law) supervision for vision-and-language navigation in continuous environments

    Sonia Raychaudhuri, Saim Wani, Shivansh Patel, Unnat Jain, and Angel Chang. Language-aligned waypoint (law) supervision for vision-and-language navigation in continuous environments. InProceed- ings of the 2021 conference on empirical methods in natural language processing, pages 4018–4028, 2021

  70. [73]

    g3d-lf: Generalizable 3d-language feature fields for embodied tasks

    Zihan Wang and Gim Hee Lee. g3d-lf: Generalizable 3d-language feature fields for embodied tasks. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14191–14202, 2025

  71. [74]

    Na vid-4d: Unleashing spatial intelligence in egocentric rgb-d videos for vision-and-language navigation

    Haoran Liu, Weikang Wan, Xiqian Yu, Minghan Li, Jiazhao Zhang, Bo Zhao, Zhibo Chen, Zhongyuan Wang, Zhizheng Zhang, and He Wang. Na vid-4d: Unleashing spatial intelligence in egocentric rgb-d videos for vision-and-language navigation. In2025 IEEE International Conference on Robotics and Au- tomation (ICRA), pages 10607–10615. IEEE, 2025

  72. [75]

    Sim-to-real transfer via 3d fea- ture fields for vision-and-language navigation.arXiv preprint arXiv:2406.09798, 2024

    Zihan Wang, Xiangyang Li, Jiahao Yang, Yeqi Liu, and Shuqiang Jiang. Sim-to-real transfer via 3d fea- ture fields for vision-and-language navigation.arXiv preprint arXiv:2406.09798, 2024

  73. [76]

    Ground slow, move fast: A dual-system foundation model for generalizable vision-and-language navigation.arXiv preprint arXiv:2512.08186, 2025

    Meng Wei, Chenyang Wan, Jiaqi Peng, Xiqian Yu, Yuqiang Yang, Delin Feng, Wenzhe Cai, Chenming Zhu, Tai Wang, Jiangmiao Pang, et al. Ground slow, move fast: A dual-system foundation model for generalizable vision-and-language navigation.arXiv preprint arXiv:2512.08186, 2025

  74. [77]

    Habitat: A platform for embodied ai research

    Manolis Savva, Abhishek Kadian, Oleksandr Maksymets, Yili Zhao, Erik Wijmans, Bhavana Jain, Julian Straub, Jia Liu, Vladlen Koltun, Jitendra Malik, et al. Habitat: A platform for embodied ai research. InProceedings of the IEEE/CVF in- ternational conference on computer vision, pages 9339–9347, 2019

  75. [78]

    Matterport3D: Learning from RGB-D Data in Indoor Environments

    Angel Chang, Angela Dai, Thomas Funkhouser, Ma- ciej Halber, Matthias Niessner, Manolis Savva, Shu- ran Song, Andy Zeng, and Yinda Zhang. Matter- port3d: Learning from rgb-d data in indoor environ- ments.arXiv preprint arXiv:1709.06158, 2017

  76. [79]

    Qwen3-VL Technical Report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhi- fang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Ming- sheng Li, Mei Li, Kaixin Li, Zicheng Lin, Jun- yang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Sh...

  77. [80]

    Scaling data generation in vision-and-language nav- igation

    Zun Wang, Jialu Li, Yicong Hong, Yi Wang, Qi Wu, MohitBansal, StephenGould, HaoTan, andYuQiao. Scaling data generation in vision-and-language nav- igation. InProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision, pages 12009– 12020, 2023

  78. [81]

    A reduction of imitation learning and structured prediction to no-regret online learning

    Stéphane Ross, Geoffrey Gordon, and Drew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. InProceed- ings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pages 627–635. JMLR Workshop and Conference Proceedings, 2011

  79. [82]

    Kingma and Max Welling

    Diederik P. Kingma and Max Welling. Auto- encoding variational bayes. InInternational Confer- ence on Learning Representations, 2014

  80. [83]

    Emerging properties in self-supervised vision transformers

    Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. InProceedings of the IEEE/CVF In- ternational Conference on Computer Vision, pages 9650–9660, 2021. 14