FutureNav: Unified World-Action Modeling for Vision-and-Language Navigation
Pith reviewed 2026-06-30 05:03 UTC · model grok-4.3
The pith
FutureNav jointly optimizes action policy with dynamics and future generation to achieve SOTA VLN on a 4B model.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
FutureNav is a VLM-based unified world-action modeling framework that jointly encodes text, visual, and spatial features into an LLM and optimizes four objectives simultaneously—an action policy objective for navigation action prediction, inverse and forward dynamics objectives for modeling state transitions, and a future generation objective for predicting future spatial states—to strengthen action prediction while explicitly modeling the world.
What carries the argument
Simultaneous optimization of four objectives (action policy, inverse dynamics, forward dynamics, future generation) on combined text-visual-spatial features inside a single LLM.
If this is right
- Action prediction improves by explicitly modeling state transitions across sequences.
- State-of-the-art results are reached on multiple VLN benchmarks using only a 4B-scale backbone.
- Prior VLN methods are outperformed while inference speed remains unchanged.
- The architecture supports future development of world-action models for navigation.
Where Pith is reading between the lines
- Learned predictive world models may improve robustness on longer or more complex trajectories.
- The same joint objective structure could transfer to other embodied tasks that require planning over time.
- Public release of the code would allow direct tests of whether the four objectives interact without trade-offs.
- Scaling the unified approach might further lower the backbone size needed for strong navigation performance.
Load-bearing premise
Jointly optimizing the four objectives produces better action prediction than separate training or prior VLN methods without hidden trade-offs in generalization.
What would settle it
An ablation experiment in which removing any one of the four objectives yields equal or better benchmark scores than the full joint model.
read the original abstract
Vision-and-language navigation (VLN) in continuous environments requires an agent to ground instructions in egocentric observations while maintaining spatial understanding across long action sequences. Recent navigation foundation models have shown strong progress by scaling vision-language models, but they often learn navigation primarily as direct action generation, without explicitly modeling world states or predicting their future evolution. We introduce FutureNav, a VLM-based unified world-action modeling framework for vision-and-language navigation. Specifically, FutureNav jointly encodes text, visual, and spatial features and feeds them into the LLM, and optimizes four objectives for simultaneous world and action modeling: an action policy objective for navigation action prediction, inverse and forward dynamics objectives for modeling state transitions, and a future generation objective for predicting future spatial states. This unified architecture strengthens action prediction while explicitly modeling the world, without sacrificing inference speed. Extensive experiments show that, with only a 4B-scale backbone, FutureNav achieves state-of-the-art performance on multiple VLN benchmarks and substantially outperforms prior VLN methods, paving the way toward future world-action models for VLN. We will release the code and models to support future research.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces FutureNav, a VLM-based framework for vision-and-language navigation in continuous environments. It jointly encodes text, visual, and spatial features into an LLM and optimizes four objectives simultaneously: action policy for navigation, inverse dynamics, forward dynamics, and future generation for spatial state prediction. The central claim is that this unified world-action modeling yields state-of-the-art performance on multiple VLN benchmarks using only a 4B-scale backbone, outperforming prior VLN methods while preserving inference speed.
Significance. If the empirical results hold with proper controls, the work would demonstrate that explicit joint optimization of dynamics and future prediction objectives can strengthen action prediction in VLN without added inference cost, potentially shifting the field toward integrated world-action foundation models. The release of code and models would further support reproducibility.
major comments (2)
- [Abstract] Abstract: the claim that FutureNav 'achieves state-of-the-art performance on multiple VLN benchmarks and substantially outperforms prior VLN methods' with a 4B backbone supplies no quantitative numbers, baseline tables, ablation results, or error analysis. This absence makes it impossible to verify whether gains are attributable to the four-objective joint training rather than backbone scale, data, or architecture alone.
- [Abstract] The central claim requires that joint optimization of action policy + inverse/forward dynamics + future generation improves action prediction over prior methods or action-policy-only training. No controlled ablation (full model vs. reduced-objective variants) is referenced, leaving the causal contribution of the world-modeling objectives untested and the performance attribution open to alternative explanations.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on the abstract. We address the two major comments below and will revise the abstract accordingly to improve clarity and support for the claims.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim that FutureNav 'achieves state-of-the-art performance on multiple VLN benchmarks and substantially outperforms prior VLN methods' with a 4B backbone supplies no quantitative numbers, baseline tables, ablation results, or error analysis. This absence makes it impossible to verify whether gains are attributable to the four-objective joint training rather than backbone scale, data, or architecture alone.
Authors: We agree that the abstract would benefit from including key quantitative results to allow readers to assess the performance claims directly. In the revised version, we will add specific metrics (e.g., success rate improvements on the primary VLN benchmarks) and a brief comparison to prior methods while remaining within abstract length constraints. revision: yes
-
Referee: [Abstract] The central claim requires that joint optimization of action policy + inverse/forward dynamics + future generation improves action prediction over prior methods or action-policy-only training. No controlled ablation (full model vs. reduced-objective variants) is referenced, leaving the causal contribution of the world-modeling objectives untested and the performance attribution open to alternative explanations.
Authors: The manuscript body contains controlled ablation studies comparing the full four-objective model against reduced-objective variants to isolate the contribution of the dynamics and future-generation terms. These results are not referenced in the current abstract. We will revise the abstract to explicitly note that ablation experiments support the benefit of joint world-action modeling. revision: yes
Circularity Check
No significant circularity; empirical claims rest on external benchmarks
full rationale
The paper is an empirical ML contribution describing a VLM-based agent trained with four joint objectives (action policy, inverse/forward dynamics, future generation). No mathematical derivation, equations, or parameter-fitting steps are present in the provided text that would reduce any claimed prediction to its own inputs by construction. No self-citation load-bearing, uniqueness theorems, or ansatz smuggling appear. Performance claims are asserted via benchmark results rather than internal reductions, satisfying the condition for a self-contained empirical result against external data.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Vision-and- language navigation: Interpreting visually-grounded navigation instructions in real environments
Peter Anderson, Qi Wu, Damien Teney, Jake Bruce, Mark Johnson, Niko Sünderhauf, Ian Reid, Stephen Gould, and Anton Van Den Hengel. Vision-and- language navigation: Interpreting visually-grounded navigation instructions in real environments. InPro- ceedings of the IEEE conference on computer vision and pattern recognition, pages 3674–3683, 2018
2018
-
[2]
Beyond the nav-graph: Vision-and-language navigation in continuous envi- ronments
Jacob Krantz, Erik Wijmans, Arjun Majumdar, Dhruv Batra, and Stefan Lee. Beyond the nav-graph: Vision-and-language navigation in continuous envi- ronments. InEuropean Conference on Computer Vision, pages 104–120. Springer, 2020
2020
-
[3]
Alexander Ku, Peter Anderson, Roma Patel, Eu- gene Ie, and Jason Baldridge. Room-across-room: Multilingual vision-and-language navigation with dense spatiotemporal grounding.arXiv preprint arXiv:2010.07954, 2020
arXiv 2010
-
[4]
Waypoint mod- els for instruction-guided navigation in continuous environments
Jacob Krantz, Aaron Gokaslan, Dhruv Batra, Stefan Lee, and Oleksandr Maksymets. Waypoint mod- els for instruction-guided navigation in continuous environments. InProceedings of the IEEE/CVF In- ternational Conference on Computer Vision, pages 15162–15171, 2021
2021
-
[5]
Bridging the gap between learning in discrete andcontinuousenvironmentsforvision-and-language navigation
Yicong Hong, Zun Wang, Qi Wu, and Stephen Gould. Bridging the gap between learning in discrete andcontinuousenvironmentsforvision-and-language navigation. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 15439–15449, 2022
2022
-
[6]
Learning navigational visual representations with semantic map supervision
Yicong Hong, Yang Zhou, Ruiyi Zhang, Franck Der- noncourt, Trung Bui, Stephen Gould, and Hao Tan. Learning navigational visual representations with semantic map supervision. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 3055–3067, 2023
2023
-
[7]
Gridmm: Grid memory map for vision-and-language navigation
Zihan Wang, Xiangyang Li, Jiahao Yang, Yeqi Liu, and Shuqiang Jiang. Gridmm: Grid memory map for vision-and-language navigation. InProceedings of the IEEE/CVF International conference on computer vision, pages 15625–15636, 2023
2023
-
[8]
Dreamwalker: Mental planning for continuous vision-language navigation
Hanqing Wang, Wei Liang, Luc Van Gool, and Wen- guan Wang. Dreamwalker: Mental planning for continuous vision-language navigation. InProceed- ings of the IEEE/CVF international conference on computer vision, pages 10873–10883, 2023
2023
-
[9]
Mapnav: A novel memory representation via annotated semantic maps for vlm-based vision-and- language navigation
Lingfeng Zhang, Xiaoshuai Hao, Qinwen Xu, Qiang Zhang, Xinyao Zhang, Pengwei Wang, Jing Zhang, Zhongyuan Wang, Shanghang Zhang, and Renjing Xu. Mapnav: A novel memory representation via annotated semantic maps for vlm-based vision-and- language navigation. InProceedings of the 63rd Annual Meeting of the Association for Computational 10 Linguistics (Volume...
2025
-
[10]
Toponav: Topological graphs as a key enabler for advanced object navigation
Peng Liu, Qiang Zhang, Di Peng, Lingfeng Zhang, Yiran Qin, Huan Zhou, Jiajun Ma, Renjing Xu, and Yandong Ji. Toponav: Topological graphs as a key enabler for advanced object navigation. InIEEE In- ternational Conference on Robotics and Automation, 2026
2026
-
[11]
Trihelper: Zero-shot object navigation with dynamic assistance
Lingfeng Zhang, Qiang Zhang, Hao Wang, Erjia Xiao, Zixuan Jiang, Honglei Chen, and Renjing Xu. Trihelper: Zero-shot object navigation with dynamic assistance. InIEEE/RSJ International Conference on Intelligent Robots and Systems, 2024. URLhttps: //arxiv.org/abs/2403.15223
arXiv 2024
-
[12]
Multi-floor zero-shot object navigation policy
Lingfeng Zhang, Hanqing Wang, Enyu Xiao, Xinyao Zhang, Qiang Zhang, Zihan Jiang, and Renjing Xu. Multi-floor zero-shot object navigation policy. In IEEE International Conference on Robotics and Au- tomation, 2025
2025
-
[13]
Stairway to success: An online floor- awarezero-shotobject-goalnavigationframeworkvia llm-driven coarse-to-fine exploration.IEEE Robotics and Automation Letters, 2026
Zeying Gong, Rui Li, Tianyu Hu, Ruofei Qiu, Linghe Kong, Lingfeng Zhang, Guoqing Zhao, Yu Ding, and Junwei Liang. Stairway to success: An online floor- awarezero-shotobject-goalnavigationframeworkvia llm-driven coarse-to-fine exploration.IEEE Robotics and Automation Letters, 2026
2026
-
[14]
Socialnav-map: Dynamic mapping with human tra- jectory prediction for zero-shot social navigation
Lingfeng Zhang, Erjia Xiao, Xiaoshuai Hao, Haox- iang Fu, Zeying Gong, Long Chen, Xiaojun Liang, Renjing Xu, Hangjun Ye, and Wenbo Ding. Socialnav-map: Dynamic mapping with human tra- jectory prediction for zero-shot social navigation. arXiv preprint arXiv:2511.12232, 2025. URLhttps: //arxiv.org/abs/2511.12232
arXiv 2025
-
[15]
Lingfeng Zhang, Xiaoshuai Hao, Xinyu Bu, Yingbo Tang, Haoran Li, Jian Lu, Xinyu Wei, Jiajun Ma, Yichen Liu, Jing Zhang, et al. Walk with me: Long- horizon social navigation for human-centric outdoor assistance.arXiv preprint arXiv:2604.26839, 2026. URLhttps://arxiv.org/abs/2604.26839
Pith/arXiv arXiv 2026
-
[16]
The robosense challenge: Sense anything, navigate anywhere, adapt across platforms.arXiv preprint, 2026
Linghe Kong, Sicheng Xie, Zeying Gong, Yuxiang Li, Min Chu, An Liang, Yuhang Dong, Tianyu Hu, Ruofei Qiu, Rui Li, et al. The robosense challenge: Sense anything, navigate anywhere, adapt across platforms.arXiv preprint, 2026
2026
-
[17]
Team xiaomi ev-ad vla: Caption- guided retrieval system for cross-modal drone nav- igation – technical report for iros 2025 robosense challenge track 4
Lingfeng Zhang, Enyu Xiao, Yifan Zhang, Haoxiang Fu, Rui Hu, Yuchen Ma, Wenbo Ding, Long Chen, Hang Ye, et al. Team xiaomi ev-ad vla: Caption- guided retrieval system for cross-modal drone nav- igation – technical report for iros 2025 robosense challenge track 4. Technical report, IROS 2025 Ro- boSense Challenge, 2025
2025
-
[18]
Learning to navigate socially through proactive risk perception – technical report for iros 2025 robosense challenge social navigation track
Enyu Xiao, Lingfeng Zhang, Yingbo Tang, Haoran Cheng, Renjing Xu, Wenbo Ding, Li Zhou, Long Chen, Hang Ye, et al. Learning to navigate socially through proactive risk perception – technical report for iros 2025 robosense challenge social navigation track. Technical report, IROS 2025 RoboSense Chal- lenge, 2025
2025
-
[19]
Mapfusion: A novel bev feature fusion network for multi-modal map construc- tion.Information Fusion, 119:103018, 2025
Xiaoshuai Hao, Yunfeng Diao, Mengchuan Wei, Yi- fan Yang, Peng Hao, Rong Yin, Hui Zhang, Weiming Li, Shu Zhao, and Yu Liu. Mapfusion: A novel bev feature fusion network for multi-modal map construc- tion.Information Fusion, 119:103018, 2025
2025
-
[20]
Synergistic prompting for comple- mentarity and consistency in incomplete multi-view clustering.IEEE Transactions on Image Processing, 2026
Xiaoshuai Hao, Zhihui Zhang, Yingbo Tang, Lingfeng Zhang, Peng Hao, Yunfeng Diao, Guangyin Jin, and Yu Liu. Synergistic prompting for comple- mentarity and consistency in incomplete multi-view clustering.IEEE Transactions on Image Processing, 2026
2026
-
[21]
Embodied spatial affordance: spatial-aware affordance learning for embodied nav- igation and manipulation.IEEE Transactions on Image Processing, 2026
Xiaoshuai Hao, Yingbo Tang, Lingfeng Zhang, Long Chen, Wei Zhou, Jungong Han, Wenbo Ding, and Xiao-Ping Zhang. Embodied spatial affordance: spatial-aware affordance learning for embodied nav- igation and manipulation.IEEE Transactions on Image Processing, 2026
2026
-
[22]
Embodiedplan- 1k: A benchmark for complex navigation- manipulation task planning
Lingfeng Zhang, Yingbo Tang, Xinyu Zheng, Liang Li, Jinglin Xu, and Xiaoshuai Hao. Embodiedplan- 1k: A benchmark for complex navigation- manipulation task planning. 2026
2026
-
[23]
Jiazhao Zhang, Kunyu Wang, Rongtao Xu, Gengze Zhou, Yicong Hong, Xiaomeng Fang, Qi Wu, Zhizheng Zhang, and He Wang. Navid: Video-based vlm plans the next step for vision-and-language nav- igation.arXiv preprint arXiv:2402.15852, 2024
Pith/arXiv arXiv 2024
-
[24]
Jiazhao Zhang, Kunyu Wang, Shaoan Wang, Ming- han Li, Haoran Liu, Songlin Wei, Zhongyuan Wang, Zhizheng Zhang, and He Wang. Uni-navid: A video-based vision-language-action model for uni- fying embodied navigation tasks.arXiv preprint arXiv:2412.06224, 2024
Pith/arXiv arXiv 2024
-
[25]
An-Chieh Cheng, Yandong Ji, Zhaojing Yang, Zaitian Gongye, Xueyan Zou, Jan Kautz, Erdem Bıyık, Hongxu Yin, Sifei Liu, and Xiaolong Wang. Navila: Legged robot vision-language-action model for navigation.arXiv preprint arXiv:2412.04453, 2024
arXiv 2024
-
[26]
Meng Wei, Chenyang Wan, Xiqian Yu, Tai Wang, Yuqiang Yang, Xiaohan Mao, Chenming Zhu, Wenzhe Cai, Hanqing Wang, Yilun Chen, et al. Streamvln: Streaming vision-and-language naviga- tion via slowfast context modeling.arXiv preprint arXiv:2507.05240, 2025
arXiv 2025
-
[27]
Embodied Navigation Foundation Model
Jiazhao Zhang, Anqi Li, Yunpeng Qi, Minghan Li, Jiahang Liu, Shaoan Wang, Haoran Liu, Gengze Zhou, Yuze Wu, Xingxing Li, Yuxin Fan, Wenjun Li, Zhibo Chen, Fei Gao, Qi Wu, Zhizheng Zhang, and He Wang. Embodied Navigation Foundation Model. InInternational Conference on Learning Representations, 2026. URL https://openreview. net/forum?id=kkBOIsrCXh. 11
2026
-
[29]
URLhttps://arxiv.org/abs/2508.04598
-
[30]
Robobrain 2.0 technical report.arXiv preprint arXiv:2507.02029, 2025
BAAI RoboBrain Team, Mingyu Cao, Huajie Tan, Yuheng Ji, Xiansheng Chen, Minglan Lin, Zhiyu Li, Zhou Cao, Pengwei Wang, Enshen Zhou, et al. Robobrain 2.0 technical report.arXiv preprint arXiv:2507.02029, 2025. URL https://arxiv.org/ abs/2507.02029
arXiv 2025
-
[31]
Mimo-embodied: X- embodied foundation model technical report.arXiv preprint arXiv:2511.16518, 2025
Xiaoshuai Hao, Lei Zhou, Zhijian Huang, Zhiwen Hou, Yingbo Tang, Lingfeng Zhang, Guang Li, Zheng Lu, Shuhuai Ren, et al. Mimo-embodied: X- embodied foundation model technical report.arXiv preprint arXiv:2511.16518, 2025. URL https:// arxiv.org/abs/2511.16518
Pith/arXiv arXiv 2025
-
[32]
Onevla: A uni- fied framework for embodied tasks.arXiv preprint arXiv:2606.01241, 2026
Lingfeng Zhang, Xiaoshuai Hao, Yingbo Tang, Lei Zhou, Shuyi Zhang, Jinkun Liu, Hongsheng Li, Chen- hao Zhang, Qiang Zhang, Hangjun Ye, Xiaojun Liang, Long Chen, and Wenbo Ding. Onevla: A uni- fied framework for embodied tasks.arXiv preprint arXiv:2606.01241, 2026. URL https://arxiv.org/ abs/2606.01241
Pith/arXiv arXiv 2026
-
[33]
Jian Lu, Jian Guan, Zhaorui Huang, Jiacheng Li, Guoqiang Li, Linghe Kong, Yuxiang Li, Hao- ran Wang, Shiyu Xu, Yifan Luo, Fei Li, et al. Onevl: One-step latent reasoning and planning with vision-language explanation.arXiv preprint arXiv:2604.18486, 2026. URL https://arxiv.org/ abs/2604.18486
Pith/arXiv arXiv 2026
-
[34]
Video-cot: A com- prehensive dataset for spatiotemporal understanding of videos based on chain-of-thought
Shilong Zhang, Xiaoshuai Hao, Yingbo Tang, Lingfeng Zhang, Pengwei Wang, Zhongyuan Wang, Huan Ma, and Shanghang Zhang. Video-cot: A com- prehensive dataset for spatiotemporal understanding of videos based on chain-of-thought. InProceedings of the ACM International Conference on Multimedia, 2025
2025
-
[35]
Is your vlm sky-ready? a comprehensive spatial intelligence benchmark for uav navigation
Lingfeng Zhang, Yifan Zhang, Haoran Li, Haoxiang Fu, Yingbo Tang, Hang Ye, Long Chen, Xiaojun Liang, Xiaoshuai Hao, et al. Is your vlm sky-ready? a comprehensive spatial intelligence benchmark for uav navigation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recog- nition, 2026
2026
-
[37]
URLhttps://arxiv.org/abs/2503.09010
-
[38]
Roboafford: A dataset and benchmark for enhancing object and spatial affordance learning in robot manipulation
Yingbo Tang, Lingfeng Zhang, Shilong Zhang, Yifan Zhao, and Xiaoshuai Hao. Roboafford: A dataset and benchmark for enhancing object and spatial affordance learning in robot manipulation. InPro- ceedings of the ACM International Conference on Multimedia, 2025
2025
-
[39]
Roboafford++: A gener- ative ai-enhanced dataset for multimodal affordance learning in robotic manipulation and navigation
Xiaoshuai Hao, Yingbo Tang, Lingfeng Zhang, Yuchen Ma, Yuhang Diao, Zihan Jia, Wenbo Ding, Hang Ye, and Long Chen. Roboafford++: A gener- ative ai-enhanced dataset for multimodal affordance learning in robotic manipulation and navigation. In IEEE/RSJ International Conference on Intelligent Robots and Systems Workshop on RoDGE, 2025
2025
-
[40]
Evaluating gpt-4o’s embodied intelligence: A comprehensive empirical study
Yifan Wu, Haoran Lyu, Yingbo Tang, Lingfeng Zhang, Ziheng Zhang, Wenxuan Zhou, and Shibo Hao. Evaluating gpt-4o’s embodied intelligence: A comprehensive empirical study. Technical report, Technical Report, 2025
2025
-
[41]
Exploring typographic visual prompts injection threats in cross-modality genera- tion models
Haoran Cheng, Enyu Xiao, Yujie Wang, Lingfeng Zhang, Kaidi Xu, Meng Sun, Xiaoshuai Hao, Jinjin Gu, and Renjing Xu. Exploring typographic visual prompts injection threats in cross-modality genera- tion models. InInternational Joint Conference on Artificial Intelligence Workshop on Deepfake Detec- tion, Localization and Interpretability, 2025
2025
-
[42]
Vquala 2025 challenge on engagement prediction for short videos: Methods and results
Dong Li, Shuang Ma, Hang Hua, Wei Li, Jian Wang, Chengwei Zhou, Feng Guan, Xin Li, Zhi Yu, Yao Lu, et al. Vquala 2025 challenge on engagement prediction for short videos: Methods and results. In IEEE/CVF International Conference on Computer Vision Workshop, 2025
2025
-
[43]
H2r-bm: Can leveraging human videos enhance per- formance and generalizability in robotic bimanual manipulation?Pattern Recognition, page 113637, 2026
Xiaoshuai Hao, Haoran Lyu, Lingfeng Zhang, Ruidong Liu, Di Wu, Jing Zhang, and Long Chen. H2r-bm: Can leveraging human videos enhance per- formance and generalizability in robotic bimanual manipulation?Pattern Recognition, page 113637, 2026
2026
-
[44]
What you see is what you reach: Towards spatial navigation with high-level human instructions
Lingfeng Zhang, Haoxiang Fu, Xiaoshuai Hao, Shi- long Zhang, Qiang Zhang, Ruidong Liu, Long Chen, and Wenbo Ding. What you see is what you reach: Towards spatial navigation with high-level human instructions. InProceedings of the AAAI Conference on Artificial Intelligence, 2026
2026
-
[45]
Qiang Zhang, Jiajun Ma, Peng Liu, Shiyu Shi, Zhiqiang Su, Zhongyuan Wang, Jiaming Sun, Wenx- uan Cui, Jia Yu, Guang Han, et al. Mesh- mimic: Geometry-aware humanoid motion learning through 3d scene reconstruction.arXiv preprint arXiv:2602.15733, 2026. URL https://arxiv.org/ abs/2602.15733
arXiv 2026
-
[46]
JanusVLN: Decoupling Se- mantics and Spatiality with Dual Implicit Memory for Vision-Language Navigation
Shuang Zeng, Dekang Qi, Xinyuan Chang, Feng Xiong, Shichao Xie, Xiaolong Wu, Shiyi Liang, Mu Xu, and Xing Wei. JanusVLN: Decoupling Se- mantics and Spatiality with Dual Implicit Memory for Vision-Language Navigation. InInternational Conference on Learning Representations, 2026. URL https://openreview.net/forum?id=RnuB0Nlbd5. 12
2026
-
[47]
Navigation world models
AmirBar, GaoyueZhou, DannyTran, TrevorDarrell, and Yann LeCun. Navigation world models. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15791–15801,
-
[48]
URLhttps://arxiv.org/abs/2412.03572
-
[49]
Baining Zhao, Jiacheng Xu, Weicheng Feng, Xin Zhang, Zhaolu Wang, Haoyang Wang, Shilong Ji, Ziyou Wang, Jianjie Fang, Zhiheng Zheng, Weichen Zhang, Yu Shang, Wei Wu, Chen Gao, Xinlei Chen, and Yong Li. Worldvln: Autoregressive world action model for aerial vision-language navigation.arXiv preprint arXiv:2605.15964, 2026. URL https:// arxiv.org/abs/2605.15964
Pith/arXiv arXiv 2026
-
[50]
Ning Yang, Yan Huang, Kaiwen Peng, Ziheng He, Kai Wang, Cui Miao, Kailin Lyu, Guo Li, Xiaofeng Wang, Zheng Zhu, Jing Liu, and Nianfeng Liu. Wam-nav: Asymmetric latent world-action model- ing for unified visual navigation.arXiv preprint arXiv:2606.04907, 2026. URL https://arxiv.org/ abs/2606.04907
Pith/arXiv arXiv 2026
-
[51]
Fei Liu, Shichao Xie, Minghua Luo, Zedong Chu, Junjun Hu, Xiaolong Wu, and Mu Xu. Navforesee: A unified vision-language world model for hierarchi- cal planning and dual-horizon navigation prediction. arXiv preprint arXiv:2512.01550, 2025
arXiv 2025
-
[52]
Junjun Hu, Jintao Chen, Haochen Bai, Minghua Luo, Shichao Xie, Ziyi Chen, Fei Liu, Zedong Chu, Xinda Xue, Botao Ren, et al. Astranav-world: World model for foresight control and consistency.arXiv preprint arXiv:2512.21714, 2025
Pith/arXiv arXiv 2025
-
[53]
Xuan Yao, Junyu Gao, and Changsheng Xu. Nav- morph: A self-evolving world model for vision-and- language navigation in continuous environments. arXiv preprint arXiv:2506.23468, 2025
arXiv 2025
-
[54]
Yichen Liu, Peng Sun, Shuo Li, Yuxuan Xie, Lingfeng Zhang, Xingyu Chao, Siyuan Dong, Fei Chen, Xiaoping Zhang, et al. Oa-wam: Object- addressable world action model for robust robot ma- nipulation.arXiv preprint arXiv:2605.06481, 2026. URLhttps://arxiv.org/abs/2605.06481
Pith/arXiv arXiv 2026
-
[55]
Jiahao Liu, Haoran Chi, Lingfeng Zhang, Yuxuan Xie, Yanan Wang, Long Chen, Hang Ye, Xiaoshuai Hao, and Wenbo Ding. Thinking in text and im- ages: Interleaved vision-language reasoning traces for long-horizon robot manipulation.arXiv preprint arXiv:2605.00438, 2026. URL https://arxiv.org/ abs/2605.00438
Pith/arXiv arXiv 2026
-
[56]
Yuchen Ma, Fei Luo, Lingfeng Zhang, Chen Zhao, Ming Wang, Yifan Wu, Ziyu Qian, Yao Lu, Long Chen, et al. Reasoning emerges from constrained inference manifolds in large language models.arXiv preprint arXiv:2605.08142, 2026. URL https:// arxiv.org/abs/2605.08142
Pith/arXiv arXiv 2026
-
[57]
Sef-map: Subspace-decomposed expert fusion for robust multimodal hd map predic- tion
Haoxiang Fu, Lingfeng Zhang, Haoran Li, Rui Hu, Zi- han Li, Guoqing Liu, Zeyu Tan, Long Chen, Hang Ye, and Xiaoshuai Hao. Sef-map: Subspace-decomposed expert fusion for robust multimodal hd map predic- tion. InIEEE International Conference on Robotics and Automation, 2026
2026
-
[59]
URLhttps://arxiv.org/abs/2604.05405
-
[60]
Pathdreamer: A world model for indoor navigation.arXiv preprint arXiv:2105.08756, 2021
Jing Yu Koh, Honglak Lee, Yinfei Yang, Jason Baldridge, and Peter Anderson. Pathdreamer: A world model for indoor navigation.arXiv preprint arXiv:2105.08756, 2021. URL https://arxiv.org/ abs/2105.08756
arXiv 2021
-
[61]
Yanjia Huang, Xianshun Jiang, Xiangbo Gao, Mingyang Wu, and Zhengzhong Tu. Vistav2: World imagination for indoor vision-and-language naviga- tion.arXiv preprint arXiv:2512.00041, 2025. URL https://arxiv.org/abs/2512.00041
arXiv 2025
-
[62]
Bingqian Lin, Yunshuang Nie, Ziming Wei, Jiaqi Chen, Shikui Ma, Jianhua Han, Hang Xu, Xiaojun Chang, and Xiaodan Liang. Navcot: Boosting llm- based vision-and-language navigation via learning disentangled reasoning.IEEE Transactions on Pat- tern Analysis and Machine Intelligence, 2025. URL https://arxiv.org/abs/2403.07376
arXiv 2025
-
[63]
Shuo Wang, Yongcai Wang, Wanting Li, Yucheng Wang, Maiyue Chen, Kaihui Wang, Zhizhong Su, Xudong Cai, Yeying Jin, Deying Li, and Zhaoxin Fan. Monodream: Monocular vision-language nav- igation with panoramic dreaming.arXiv preprint arXiv:2508.02549, 2025. URL https://arxiv.org/ abs/2508.02549
arXiv 2025
-
[64]
Tianyuan Yuan, Zibin Dong, Yicheng Liu, and Hang Zhao. Fast-wam: Do world action models need test-time future imagination?arXiv preprint arXiv:2603.16666, 2026
Pith/arXiv arXiv 2026
-
[65]
Pengna Li, Kangyi Wu, Shaoqing Xu, Fang Li, Han- bing Li, Lin Zhao, Kailin Lyu, Long Chen, Zhi- Xin Yang, and Nanning Zheng. Spaact: Spatially- activated transition learning with curriculum adap- tation for vision-language navigation.arXiv preprint arXiv:2604.27620, 2026
Pith/arXiv arXiv 2026
-
[66]
Vggt: Visual geometry grounded trans- former
Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded trans- former. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5294–5306, 2025
2025
-
[67]
Sim-2-sim transfer for vision-and-language navigation in continuous en- vironments
Jacob Krantz and Stefan Lee. Sim-2-sim transfer for vision-and-language navigation in continuous en- vironments. InEuropean conference on computer vision, pages 588–603. Springer, 2022. 13
2022
-
[68]
Dong An, Zun Wang, Yangguang Li, Yi Wang, Yi- cong Hong, Yan Huang, Liang Wang, and Jing Shao. 1st place solutions for rxr-habitat vision-and- language navigation competition (cvpr 2022).arXiv preprint arXiv:2206.11610, 2022
arXiv 2022
-
[69]
Yuxing Long, Wenzhe Cai, Hongcheng Wang, Guanqi Zhan, and Hao Dong. Instructnav: Zero-shot system for generic instruction naviga- tion in unexplored environment.arXiv preprint arXiv:2406.04882, 2024
arXiv 2024
-
[70]
Cosmo: Combina- tion of selective memorization for low-cost vision- and-language navigation
Siqi Zhang, Yanyuan Qiao, Qunbo Wang, Zike Yan, Qi Wu, Zhihua Wei, and Jing Liu. Cosmo: Combina- tion of selective memorization for low-cost vision- and-language navigation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 5511–5522, 2025
2025
-
[71]
Affordances- oriented planning using foundation models for con- tinuous vision-language navigation
Jiaqi Chen, Bingqian Lin, Xinmin Liu, Lin Ma, Xi- aodan Liang, and Kwan-Yee K Wong. Affordances- oriented planning using foundation models for con- tinuous vision-language navigation. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 23568–23576, 2025
2025
-
[72]
Language-aligned waypoint (law) supervision for vision-and-language navigation in continuous environments
Sonia Raychaudhuri, Saim Wani, Shivansh Patel, Unnat Jain, and Angel Chang. Language-aligned waypoint (law) supervision for vision-and-language navigation in continuous environments. InProceed- ings of the 2021 conference on empirical methods in natural language processing, pages 4018–4028, 2021
2021
-
[73]
g3d-lf: Generalizable 3d-language feature fields for embodied tasks
Zihan Wang and Gim Hee Lee. g3d-lf: Generalizable 3d-language feature fields for embodied tasks. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14191–14202, 2025
2025
-
[74]
Na vid-4d: Unleashing spatial intelligence in egocentric rgb-d videos for vision-and-language navigation
Haoran Liu, Weikang Wan, Xiqian Yu, Minghan Li, Jiazhao Zhang, Bo Zhao, Zhibo Chen, Zhongyuan Wang, Zhizheng Zhang, and He Wang. Na vid-4d: Unleashing spatial intelligence in egocentric rgb-d videos for vision-and-language navigation. In2025 IEEE International Conference on Robotics and Au- tomation (ICRA), pages 10607–10615. IEEE, 2025
2025
-
[75]
Zihan Wang, Xiangyang Li, Jiahao Yang, Yeqi Liu, and Shuqiang Jiang. Sim-to-real transfer via 3d fea- ture fields for vision-and-language navigation.arXiv preprint arXiv:2406.09798, 2024
arXiv 2024
-
[76]
Meng Wei, Chenyang Wan, Jiaqi Peng, Xiqian Yu, Yuqiang Yang, Delin Feng, Wenzhe Cai, Chenming Zhu, Tai Wang, Jiangmiao Pang, et al. Ground slow, move fast: A dual-system foundation model for generalizable vision-and-language navigation.arXiv preprint arXiv:2512.08186, 2025
arXiv 2025
-
[77]
Habitat: A platform for embodied ai research
Manolis Savva, Abhishek Kadian, Oleksandr Maksymets, Yili Zhao, Erik Wijmans, Bhavana Jain, Julian Straub, Jia Liu, Vladlen Koltun, Jitendra Malik, et al. Habitat: A platform for embodied ai research. InProceedings of the IEEE/CVF in- ternational conference on computer vision, pages 9339–9347, 2019
2019
-
[78]
Angel Chang, Angela Dai, Thomas Funkhouser, Ma- ciej Halber, Matthias Niessner, Manolis Savva, Shu- ran Song, Andy Zeng, and Yinda Zhang. Matter- port3d: Learning from rgb-d data in indoor environ- ments.arXiv preprint arXiv:1709.06158, 2017
Pith/arXiv arXiv 2017
-
[79]
Qwen3-vl technical report, 2025
Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhi- fang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Ming- sheng Li, Mei Li, Kaixin Li, Zicheng Lin, Jun- yang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Sh...
Pith/arXiv arXiv 2025
-
[80]
Scaling data generation in vision-and-language nav- igation
Zun Wang, Jialu Li, Yicong Hong, Yi Wang, Qi Wu, MohitBansal, StephenGould, HaoTan, andYuQiao. Scaling data generation in vision-and-language nav- igation. InProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision, pages 12009– 12020, 2023
2023
-
[81]
A reduction of imitation learning and structured prediction to no-regret online learning
Stéphane Ross, Geoffrey Gordon, and Drew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. InProceed- ings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pages 627–635. JMLR Workshop and Conference Proceedings, 2011
2011
-
[82]
Kingma and Max Welling
Diederik P. Kingma and Max Welling. Auto- encoding variational bayes. InInternational Confer- ence on Learning Representations, 2014
2014
-
[83]
Emerging properties in self-supervised vision transformers
Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. InProceedings of the IEEE/CVF In- ternational Conference on Computer Vision, pages 9650–9660, 2021. 14
2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.