FutureNav: Unified World-Action Modeling for Vision-and-Language Navigation

Hangjun Ye; Haoxiang Fu; Junwei Liang; Lingfeng Zhang; Mingliang Zhou; Qiang Zhang; Wenbo Ding; Xiaojun Liang; Xiaoshuai Hao; Zeying Gong

arxiv: 2606.30367 · v1 · pith:JSAV5T4Xnew · submitted 2026-06-29 · 💻 cs.RO

FutureNav: Unified World-Action Modeling for Vision-and-Language Navigation

Lingfeng Zhang , Zeying Gong , Xiaoshuai Hao , Haoxiang Fu , Qiang Zhang , Mingliang Zhou , Hangjun Ye , Xiaojun Liang

show 2 more authors

Junwei Liang Wenbo Ding

This is my paper

Pith reviewed 2026-06-30 05:03 UTC · model grok-4.3

classification 💻 cs.RO

keywords vision-and-language navigationVLNworld modelingaction predictiondynamics modelingfuture generationunified frameworkVLM

0 comments

The pith

FutureNav jointly optimizes action policy with dynamics and future generation to achieve SOTA VLN on a 4B model.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents FutureNav as a framework that encodes text, visual, and spatial features together and feeds them into an LLM for vision-and-language navigation. It trains four objectives at once: an action policy for choosing moves, inverse and forward dynamics to capture how states change, and a future generation objective to predict upcoming spatial states. The claim is that this unified training strengthens action decisions by building an explicit world model without slowing down inference. A reader would care if explicit future modeling turns out to be a more effective way to ground instructions across long sequences than methods that generate actions directly.

Core claim

FutureNav is a VLM-based unified world-action modeling framework that jointly encodes text, visual, and spatial features into an LLM and optimizes four objectives simultaneously—an action policy objective for navigation action prediction, inverse and forward dynamics objectives for modeling state transitions, and a future generation objective for predicting future spatial states—to strengthen action prediction while explicitly modeling the world.

What carries the argument

Simultaneous optimization of four objectives (action policy, inverse dynamics, forward dynamics, future generation) on combined text-visual-spatial features inside a single LLM.

If this is right

Action prediction improves by explicitly modeling state transitions across sequences.
State-of-the-art results are reached on multiple VLN benchmarks using only a 4B-scale backbone.
Prior VLN methods are outperformed while inference speed remains unchanged.
The architecture supports future development of world-action models for navigation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Learned predictive world models may improve robustness on longer or more complex trajectories.
The same joint objective structure could transfer to other embodied tasks that require planning over time.
Public release of the code would allow direct tests of whether the four objectives interact without trade-offs.
Scaling the unified approach might further lower the backbone size needed for strong navigation performance.

Load-bearing premise

Jointly optimizing the four objectives produces better action prediction than separate training or prior VLN methods without hidden trade-offs in generalization.

What would settle it

An ablation experiment in which removing any one of the four objectives yields equal or better benchmark scores than the full joint model.

read the original abstract

Vision-and-language navigation (VLN) in continuous environments requires an agent to ground instructions in egocentric observations while maintaining spatial understanding across long action sequences. Recent navigation foundation models have shown strong progress by scaling vision-language models, but they often learn navigation primarily as direct action generation, without explicitly modeling world states or predicting their future evolution. We introduce FutureNav, a VLM-based unified world-action modeling framework for vision-and-language navigation. Specifically, FutureNav jointly encodes text, visual, and spatial features and feeds them into the LLM, and optimizes four objectives for simultaneous world and action modeling: an action policy objective for navigation action prediction, inverse and forward dynamics objectives for modeling state transitions, and a future generation objective for predicting future spatial states. This unified architecture strengthens action prediction while explicitly modeling the world, without sacrificing inference speed. Extensive experiments show that, with only a 4B-scale backbone, FutureNav achieves state-of-the-art performance on multiple VLN benchmarks and substantially outperforms prior VLN methods, paving the way toward future world-action models for VLN. We will release the code and models to support future research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FutureNav adds four joint objectives to a VLM for VLN but the abstract supplies no numbers or ablations, leaving the SOTA claim unsupported.

read the letter

The core move here is training a 4B VLM on action policy plus inverse dynamics, forward dynamics, and future generation all at once for continuous VLN. The setup feeds combined text-visual-spatial features into the LLM and runs the four losses together, with the stated goal of improving action prediction through explicit world modeling without extra inference cost.

That combination is a straightforward extension of recent navigation foundation models rather than a big conceptual leap. The paper earns credit for keeping the architecture simple enough to run at the same speed as a standard VLM while adding the dynamics terms.

The problem is that the abstract asserts SOTA results and substantial outperformance on multiple benchmarks but shows zero quantitative numbers, zero baseline tables, and zero ablation results. The stress-test concern lands cleanly: without a controlled comparison of the full four-objective model against an action-policy-only version on the same data and benchmarks, it is impossible to know whether the world-modeling losses are responsible for any gains or whether scale, data, or architecture explain the outcome.

This work is aimed at the VLN foundation-model crowd. A reader already following scaling efforts in navigation would get value from seeing whether the multi-objective recipe actually moves the needle once the experiments are shown.

The paper deserves peer review so the authors can supply the missing ablations and numbers. The idea is testable and worth checking properly.

Referee Report

2 major / 0 minor

Summary. The paper introduces FutureNav, a VLM-based framework for vision-and-language navigation in continuous environments. It jointly encodes text, visual, and spatial features into an LLM and optimizes four objectives simultaneously: action policy for navigation, inverse dynamics, forward dynamics, and future generation for spatial state prediction. The central claim is that this unified world-action modeling yields state-of-the-art performance on multiple VLN benchmarks using only a 4B-scale backbone, outperforming prior VLN methods while preserving inference speed.

Significance. If the empirical results hold with proper controls, the work would demonstrate that explicit joint optimization of dynamics and future prediction objectives can strengthen action prediction in VLN without added inference cost, potentially shifting the field toward integrated world-action foundation models. The release of code and models would further support reproducibility.

major comments (2)

[Abstract] Abstract: the claim that FutureNav 'achieves state-of-the-art performance on multiple VLN benchmarks and substantially outperforms prior VLN methods' with a 4B backbone supplies no quantitative numbers, baseline tables, ablation results, or error analysis. This absence makes it impossible to verify whether gains are attributable to the four-objective joint training rather than backbone scale, data, or architecture alone.
[Abstract] The central claim requires that joint optimization of action policy + inverse/forward dynamics + future generation improves action prediction over prior methods or action-policy-only training. No controlled ablation (full model vs. reduced-objective variants) is referenced, leaving the causal contribution of the world-modeling objectives untested and the performance attribution open to alternative explanations.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract. We address the two major comments below and will revise the abstract accordingly to improve clarity and support for the claims.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that FutureNav 'achieves state-of-the-art performance on multiple VLN benchmarks and substantially outperforms prior VLN methods' with a 4B backbone supplies no quantitative numbers, baseline tables, ablation results, or error analysis. This absence makes it impossible to verify whether gains are attributable to the four-objective joint training rather than backbone scale, data, or architecture alone.

Authors: We agree that the abstract would benefit from including key quantitative results to allow readers to assess the performance claims directly. In the revised version, we will add specific metrics (e.g., success rate improvements on the primary VLN benchmarks) and a brief comparison to prior methods while remaining within abstract length constraints. revision: yes
Referee: [Abstract] The central claim requires that joint optimization of action policy + inverse/forward dynamics + future generation improves action prediction over prior methods or action-policy-only training. No controlled ablation (full model vs. reduced-objective variants) is referenced, leaving the causal contribution of the world-modeling objectives untested and the performance attribution open to alternative explanations.

Authors: The manuscript body contains controlled ablation studies comparing the full four-objective model against reduced-objective variants to isolate the contribution of the dynamics and future-generation terms. These results are not referenced in the current abstract. We will revise the abstract to explicitly note that ablation experiments support the benefit of joint world-action modeling. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical claims rest on external benchmarks

full rationale

The paper is an empirical ML contribution describing a VLM-based agent trained with four joint objectives (action policy, inverse/forward dynamics, future generation). No mathematical derivation, equations, or parameter-fitting steps are present in the provided text that would reduce any claimed prediction to its own inputs by construction. No self-citation load-bearing, uniqueness theorems, or ansatz smuggling appear. Performance claims are asserted via benchmark results rather than internal reductions, satisfying the condition for a self-contained empirical result against external data.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review performed on abstract only; no explicit free parameters, axioms, or invented entities are stated in the provided text.

pith-pipeline@v0.9.1-grok · 5757 in / 1136 out tokens · 25630 ms · 2026-06-30T05:03:34.957682+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

80 extracted references · 17 linked inside Pith

[1]

Vision-and- language navigation: Interpreting visually-grounded navigation instructions in real environments

Peter Anderson, Qi Wu, Damien Teney, Jake Bruce, Mark Johnson, Niko Sünderhauf, Ian Reid, Stephen Gould, and Anton Van Den Hengel. Vision-and- language navigation: Interpreting visually-grounded navigation instructions in real environments. InPro- ceedings of the IEEE conference on computer vision and pattern recognition, pages 3674–3683, 2018

2018
[2]

Beyond the nav-graph: Vision-and-language navigation in continuous envi- ronments

Jacob Krantz, Erik Wijmans, Arjun Majumdar, Dhruv Batra, and Stefan Lee. Beyond the nav-graph: Vision-and-language navigation in continuous envi- ronments. InEuropean Conference on Computer Vision, pages 104–120. Springer, 2020

2020
[3]

Room-across-room: Multilingual vision-and-language navigation with dense spatiotemporal grounding.arXiv preprint arXiv:2010.07954, 2020

Alexander Ku, Peter Anderson, Roma Patel, Eu- gene Ie, and Jason Baldridge. Room-across-room: Multilingual vision-and-language navigation with dense spatiotemporal grounding.arXiv preprint arXiv:2010.07954, 2020

arXiv 2010
[4]

Waypoint mod- els for instruction-guided navigation in continuous environments

Jacob Krantz, Aaron Gokaslan, Dhruv Batra, Stefan Lee, and Oleksandr Maksymets. Waypoint mod- els for instruction-guided navigation in continuous environments. InProceedings of the IEEE/CVF In- ternational Conference on Computer Vision, pages 15162–15171, 2021

2021
[5]

Bridging the gap between learning in discrete andcontinuousenvironmentsforvision-and-language navigation

Yicong Hong, Zun Wang, Qi Wu, and Stephen Gould. Bridging the gap between learning in discrete andcontinuousenvironmentsforvision-and-language navigation. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 15439–15449, 2022

2022
[6]

Learning navigational visual representations with semantic map supervision

Yicong Hong, Yang Zhou, Ruiyi Zhang, Franck Der- noncourt, Trung Bui, Stephen Gould, and Hao Tan. Learning navigational visual representations with semantic map supervision. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 3055–3067, 2023

2023
[7]

Gridmm: Grid memory map for vision-and-language navigation

Zihan Wang, Xiangyang Li, Jiahao Yang, Yeqi Liu, and Shuqiang Jiang. Gridmm: Grid memory map for vision-and-language navigation. InProceedings of the IEEE/CVF International conference on computer vision, pages 15625–15636, 2023

2023
[8]

Dreamwalker: Mental planning for continuous vision-language navigation

Hanqing Wang, Wei Liang, Luc Van Gool, and Wen- guan Wang. Dreamwalker: Mental planning for continuous vision-language navigation. InProceed- ings of the IEEE/CVF international conference on computer vision, pages 10873–10883, 2023

2023
[9]

Mapnav: A novel memory representation via annotated semantic maps for vlm-based vision-and- language navigation

Lingfeng Zhang, Xiaoshuai Hao, Qinwen Xu, Qiang Zhang, Xinyao Zhang, Pengwei Wang, Jing Zhang, Zhongyuan Wang, Shanghang Zhang, and Renjing Xu. Mapnav: A novel memory representation via annotated semantic maps for vlm-based vision-and- language navigation. InProceedings of the 63rd Annual Meeting of the Association for Computational 10 Linguistics (Volume...

2025
[10]

Toponav: Topological graphs as a key enabler for advanced object navigation

Peng Liu, Qiang Zhang, Di Peng, Lingfeng Zhang, Yiran Qin, Huan Zhou, Jiajun Ma, Renjing Xu, and Yandong Ji. Toponav: Topological graphs as a key enabler for advanced object navigation. InIEEE In- ternational Conference on Robotics and Automation, 2026

2026
[11]

Trihelper: Zero-shot object navigation with dynamic assistance

Lingfeng Zhang, Qiang Zhang, Hao Wang, Erjia Xiao, Zixuan Jiang, Honglei Chen, and Renjing Xu. Trihelper: Zero-shot object navigation with dynamic assistance. InIEEE/RSJ International Conference on Intelligent Robots and Systems, 2024. URLhttps: //arxiv.org/abs/2403.15223

arXiv 2024
[12]

Multi-floor zero-shot object navigation policy

Lingfeng Zhang, Hanqing Wang, Enyu Xiao, Xinyao Zhang, Qiang Zhang, Zihan Jiang, and Renjing Xu. Multi-floor zero-shot object navigation policy. In IEEE International Conference on Robotics and Au- tomation, 2025

2025
[13]

Stairway to success: An online floor- awarezero-shotobject-goalnavigationframeworkvia llm-driven coarse-to-fine exploration.IEEE Robotics and Automation Letters, 2026

Zeying Gong, Rui Li, Tianyu Hu, Ruofei Qiu, Linghe Kong, Lingfeng Zhang, Guoqing Zhao, Yu Ding, and Junwei Liang. Stairway to success: An online floor- awarezero-shotobject-goalnavigationframeworkvia llm-driven coarse-to-fine exploration.IEEE Robotics and Automation Letters, 2026

2026
[14]

Socialnav-map: Dynamic mapping with human tra- jectory prediction for zero-shot social navigation

Lingfeng Zhang, Erjia Xiao, Xiaoshuai Hao, Haox- iang Fu, Zeying Gong, Long Chen, Xiaojun Liang, Renjing Xu, Hangjun Ye, and Wenbo Ding. Socialnav-map: Dynamic mapping with human tra- jectory prediction for zero-shot social navigation. arXiv preprint arXiv:2511.12232, 2025. URLhttps: //arxiv.org/abs/2511.12232

arXiv 2025
[15]

Walk with me: Long- horizon social navigation for human-centric outdoor assistance.arXiv preprint arXiv:2604.26839, 2026

Lingfeng Zhang, Xiaoshuai Hao, Xinyu Bu, Yingbo Tang, Haoran Li, Jian Lu, Xinyu Wei, Jiajun Ma, Yichen Liu, Jing Zhang, et al. Walk with me: Long- horizon social navigation for human-centric outdoor assistance.arXiv preprint arXiv:2604.26839, 2026. URLhttps://arxiv.org/abs/2604.26839

Pith/arXiv arXiv 2026
[16]

The robosense challenge: Sense anything, navigate anywhere, adapt across platforms.arXiv preprint, 2026

Linghe Kong, Sicheng Xie, Zeying Gong, Yuxiang Li, Min Chu, An Liang, Yuhang Dong, Tianyu Hu, Ruofei Qiu, Rui Li, et al. The robosense challenge: Sense anything, navigate anywhere, adapt across platforms.arXiv preprint, 2026

2026
[17]

Team xiaomi ev-ad vla: Caption- guided retrieval system for cross-modal drone nav- igation – technical report for iros 2025 robosense challenge track 4

Lingfeng Zhang, Enyu Xiao, Yifan Zhang, Haoxiang Fu, Rui Hu, Yuchen Ma, Wenbo Ding, Long Chen, Hang Ye, et al. Team xiaomi ev-ad vla: Caption- guided retrieval system for cross-modal drone nav- igation – technical report for iros 2025 robosense challenge track 4. Technical report, IROS 2025 Ro- boSense Challenge, 2025

2025
[18]

Learning to navigate socially through proactive risk perception – technical report for iros 2025 robosense challenge social navigation track

Enyu Xiao, Lingfeng Zhang, Yingbo Tang, Haoran Cheng, Renjing Xu, Wenbo Ding, Li Zhou, Long Chen, Hang Ye, et al. Learning to navigate socially through proactive risk perception – technical report for iros 2025 robosense challenge social navigation track. Technical report, IROS 2025 RoboSense Chal- lenge, 2025

2025
[19]

Mapfusion: A novel bev feature fusion network for multi-modal map construc- tion.Information Fusion, 119:103018, 2025

Xiaoshuai Hao, Yunfeng Diao, Mengchuan Wei, Yi- fan Yang, Peng Hao, Rong Yin, Hui Zhang, Weiming Li, Shu Zhao, and Yu Liu. Mapfusion: A novel bev feature fusion network for multi-modal map construc- tion.Information Fusion, 119:103018, 2025

2025
[20]

Synergistic prompting for comple- mentarity and consistency in incomplete multi-view clustering.IEEE Transactions on Image Processing, 2026

Xiaoshuai Hao, Zhihui Zhang, Yingbo Tang, Lingfeng Zhang, Peng Hao, Yunfeng Diao, Guangyin Jin, and Yu Liu. Synergistic prompting for comple- mentarity and consistency in incomplete multi-view clustering.IEEE Transactions on Image Processing, 2026

2026
[21]

Embodied spatial affordance: spatial-aware affordance learning for embodied nav- igation and manipulation.IEEE Transactions on Image Processing, 2026

Xiaoshuai Hao, Yingbo Tang, Lingfeng Zhang, Long Chen, Wei Zhou, Jungong Han, Wenbo Ding, and Xiao-Ping Zhang. Embodied spatial affordance: spatial-aware affordance learning for embodied nav- igation and manipulation.IEEE Transactions on Image Processing, 2026

2026
[22]

Embodiedplan- 1k: A benchmark for complex navigation- manipulation task planning

Lingfeng Zhang, Yingbo Tang, Xinyu Zheng, Liang Li, Jinglin Xu, and Xiaoshuai Hao. Embodiedplan- 1k: A benchmark for complex navigation- manipulation task planning. 2026

2026
[23]

Navid: Video-based vlm plans the next step for vision-and-language nav- igation.arXiv preprint arXiv:2402.15852, 2024

Jiazhao Zhang, Kunyu Wang, Rongtao Xu, Gengze Zhou, Yicong Hong, Xiaomeng Fang, Qi Wu, Zhizheng Zhang, and He Wang. Navid: Video-based vlm plans the next step for vision-and-language nav- igation.arXiv preprint arXiv:2402.15852, 2024

Pith/arXiv arXiv 2024
[24]

Uni-navid: A video-based vision-language-action model for uni- fying embodied navigation tasks.arXiv preprint arXiv:2412.06224, 2024

Jiazhao Zhang, Kunyu Wang, Shaoan Wang, Ming- han Li, Haoran Liu, Songlin Wei, Zhongyuan Wang, Zhizheng Zhang, and He Wang. Uni-navid: A video-based vision-language-action model for uni- fying embodied navigation tasks.arXiv preprint arXiv:2412.06224, 2024

Pith/arXiv arXiv 2024
[25]

Navila: Legged robot vision-language-action model for navigation.arXiv preprint arXiv:2412.04453, 2024

An-Chieh Cheng, Yandong Ji, Zhaojing Yang, Zaitian Gongye, Xueyan Zou, Jan Kautz, Erdem Bıyık, Hongxu Yin, Sifei Liu, and Xiaolong Wang. Navila: Legged robot vision-language-action model for navigation.arXiv preprint arXiv:2412.04453, 2024

arXiv 2024
[26]

Streamvln: Streaming vision-and-language naviga- tion via slowfast context modeling.arXiv preprint arXiv:2507.05240, 2025

Meng Wei, Chenyang Wan, Xiqian Yu, Tai Wang, Yuqiang Yang, Xiaohan Mao, Chenming Zhu, Wenzhe Cai, Hanqing Wang, Yilun Chen, et al. Streamvln: Streaming vision-and-language naviga- tion via slowfast context modeling.arXiv preprint arXiv:2507.05240, 2025

arXiv 2025
[27]

Embodied Navigation Foundation Model

Jiazhao Zhang, Anqi Li, Yunpeng Qi, Minghan Li, Jiahang Liu, Shaoan Wang, Haoran Liu, Gengze Zhou, Yuze Wu, Xingxing Li, Yuxin Fan, Wenjun Li, Zhibo Chen, Fei Gao, Qi Wu, Zhizheng Zhang, and He Wang. Embodied Navigation Foundation Model. InInternational Conference on Learning Representations, 2026. URL https://openreview. net/forum?id=kkBOIsrCXh. 11

2026
[29]

URLhttps://arxiv.org/abs/2508.04598

arXiv
[30]

Robobrain 2.0 technical report.arXiv preprint arXiv:2507.02029, 2025

BAAI RoboBrain Team, Mingyu Cao, Huajie Tan, Yuheng Ji, Xiansheng Chen, Minglan Lin, Zhiyu Li, Zhou Cao, Pengwei Wang, Enshen Zhou, et al. Robobrain 2.0 technical report.arXiv preprint arXiv:2507.02029, 2025. URL https://arxiv.org/ abs/2507.02029

arXiv 2025
[31]

Mimo-embodied: X- embodied foundation model technical report.arXiv preprint arXiv:2511.16518, 2025

Xiaoshuai Hao, Lei Zhou, Zhijian Huang, Zhiwen Hou, Yingbo Tang, Lingfeng Zhang, Guang Li, Zheng Lu, Shuhuai Ren, et al. Mimo-embodied: X- embodied foundation model technical report.arXiv preprint arXiv:2511.16518, 2025. URL https:// arxiv.org/abs/2511.16518

Pith/arXiv arXiv 2025
[32]

Onevla: A uni- fied framework for embodied tasks.arXiv preprint arXiv:2606.01241, 2026

Lingfeng Zhang, Xiaoshuai Hao, Yingbo Tang, Lei Zhou, Shuyi Zhang, Jinkun Liu, Hongsheng Li, Chen- hao Zhang, Qiang Zhang, Hangjun Ye, Xiaojun Liang, Long Chen, and Wenbo Ding. Onevla: A uni- fied framework for embodied tasks.arXiv preprint arXiv:2606.01241, 2026. URL https://arxiv.org/ abs/2606.01241

Pith/arXiv arXiv 2026
[33]

Onevl: One-step latent reasoning and planning with vision-language explanation.arXiv preprint arXiv:2604.18486, 2026

Jian Lu, Jian Guan, Zhaorui Huang, Jiacheng Li, Guoqiang Li, Linghe Kong, Yuxiang Li, Hao- ran Wang, Shiyu Xu, Yifan Luo, Fei Li, et al. Onevl: One-step latent reasoning and planning with vision-language explanation.arXiv preprint arXiv:2604.18486, 2026. URL https://arxiv.org/ abs/2604.18486

Pith/arXiv arXiv 2026
[34]

Video-cot: A com- prehensive dataset for spatiotemporal understanding of videos based on chain-of-thought

Shilong Zhang, Xiaoshuai Hao, Yingbo Tang, Lingfeng Zhang, Pengwei Wang, Zhongyuan Wang, Huan Ma, and Shanghang Zhang. Video-cot: A com- prehensive dataset for spatiotemporal understanding of videos based on chain-of-thought. InProceedings of the ACM International Conference on Multimedia, 2025

2025
[35]

Is your vlm sky-ready? a comprehensive spatial intelligence benchmark for uav navigation

Lingfeng Zhang, Yifan Zhang, Haoran Li, Haoxiang Fu, Yingbo Tang, Hang Ye, Long Chen, Xiaojun Liang, Xiaoshuai Hao, et al. Is your vlm sky-ready? a comprehensive spatial intelligence benchmark for uav navigation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recog- nition, 2026

2026
[37]

URLhttps://arxiv.org/abs/2503.09010

arXiv
[38]

Roboafford: A dataset and benchmark for enhancing object and spatial affordance learning in robot manipulation

Yingbo Tang, Lingfeng Zhang, Shilong Zhang, Yifan Zhao, and Xiaoshuai Hao. Roboafford: A dataset and benchmark for enhancing object and spatial affordance learning in robot manipulation. InPro- ceedings of the ACM International Conference on Multimedia, 2025

2025
[39]

Roboafford++: A gener- ative ai-enhanced dataset for multimodal affordance learning in robotic manipulation and navigation

Xiaoshuai Hao, Yingbo Tang, Lingfeng Zhang, Yuchen Ma, Yuhang Diao, Zihan Jia, Wenbo Ding, Hang Ye, and Long Chen. Roboafford++: A gener- ative ai-enhanced dataset for multimodal affordance learning in robotic manipulation and navigation. In IEEE/RSJ International Conference on Intelligent Robots and Systems Workshop on RoDGE, 2025

2025
[40]

Evaluating gpt-4o’s embodied intelligence: A comprehensive empirical study

Yifan Wu, Haoran Lyu, Yingbo Tang, Lingfeng Zhang, Ziheng Zhang, Wenxuan Zhou, and Shibo Hao. Evaluating gpt-4o’s embodied intelligence: A comprehensive empirical study. Technical report, Technical Report, 2025

2025
[41]

Exploring typographic visual prompts injection threats in cross-modality genera- tion models

Haoran Cheng, Enyu Xiao, Yujie Wang, Lingfeng Zhang, Kaidi Xu, Meng Sun, Xiaoshuai Hao, Jinjin Gu, and Renjing Xu. Exploring typographic visual prompts injection threats in cross-modality genera- tion models. InInternational Joint Conference on Artificial Intelligence Workshop on Deepfake Detec- tion, Localization and Interpretability, 2025

2025
[42]

Vquala 2025 challenge on engagement prediction for short videos: Methods and results

Dong Li, Shuang Ma, Hang Hua, Wei Li, Jian Wang, Chengwei Zhou, Feng Guan, Xin Li, Zhi Yu, Yao Lu, et al. Vquala 2025 challenge on engagement prediction for short videos: Methods and results. In IEEE/CVF International Conference on Computer Vision Workshop, 2025

2025
[43]

H2r-bm: Can leveraging human videos enhance per- formance and generalizability in robotic bimanual manipulation?Pattern Recognition, page 113637, 2026

Xiaoshuai Hao, Haoran Lyu, Lingfeng Zhang, Ruidong Liu, Di Wu, Jing Zhang, and Long Chen. H2r-bm: Can leveraging human videos enhance per- formance and generalizability in robotic bimanual manipulation?Pattern Recognition, page 113637, 2026

2026
[44]

What you see is what you reach: Towards spatial navigation with high-level human instructions

Lingfeng Zhang, Haoxiang Fu, Xiaoshuai Hao, Shi- long Zhang, Qiang Zhang, Ruidong Liu, Long Chen, and Wenbo Ding. What you see is what you reach: Towards spatial navigation with high-level human instructions. InProceedings of the AAAI Conference on Artificial Intelligence, 2026

2026
[45]

Mesh- mimic: Geometry-aware humanoid motion learning through 3d scene reconstruction.arXiv preprint arXiv:2602.15733, 2026

Qiang Zhang, Jiajun Ma, Peng Liu, Shiyu Shi, Zhiqiang Su, Zhongyuan Wang, Jiaming Sun, Wenx- uan Cui, Jia Yu, Guang Han, et al. Mesh- mimic: Geometry-aware humanoid motion learning through 3d scene reconstruction.arXiv preprint arXiv:2602.15733, 2026. URL https://arxiv.org/ abs/2602.15733

arXiv 2026
[46]

JanusVLN: Decoupling Se- mantics and Spatiality with Dual Implicit Memory for Vision-Language Navigation

Shuang Zeng, Dekang Qi, Xinyuan Chang, Feng Xiong, Shichao Xie, Xiaolong Wu, Shiyi Liang, Mu Xu, and Xing Wei. JanusVLN: Decoupling Se- mantics and Spatiality with Dual Implicit Memory for Vision-Language Navigation. InInternational Conference on Learning Representations, 2026. URL https://openreview.net/forum?id=RnuB0Nlbd5. 12

2026
[47]

Navigation world models

AmirBar, GaoyueZhou, DannyTran, TrevorDarrell, and Yann LeCun. Navigation world models. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15791–15801,
[48]

URLhttps://arxiv.org/abs/2412.03572

arXiv
[49]

Worldvln: Autoregressive world action model for aerial vision-language navigation.arXiv preprint arXiv:2605.15964, 2026

Baining Zhao, Jiacheng Xu, Weicheng Feng, Xin Zhang, Zhaolu Wang, Haoyang Wang, Shilong Ji, Ziyou Wang, Jianjie Fang, Zhiheng Zheng, Weichen Zhang, Yu Shang, Wei Wu, Chen Gao, Xinlei Chen, and Yong Li. Worldvln: Autoregressive world action model for aerial vision-language navigation.arXiv preprint arXiv:2605.15964, 2026. URL https:// arxiv.org/abs/2605.15964

Pith/arXiv arXiv 2026
[50]

Wam-nav: Asymmetric latent world-action model- ing for unified visual navigation.arXiv preprint arXiv:2606.04907, 2026

Ning Yang, Yan Huang, Kaiwen Peng, Ziheng He, Kai Wang, Cui Miao, Kailin Lyu, Guo Li, Xiaofeng Wang, Zheng Zhu, Jing Liu, and Nianfeng Liu. Wam-nav: Asymmetric latent world-action model- ing for unified visual navigation.arXiv preprint arXiv:2606.04907, 2026. URL https://arxiv.org/ abs/2606.04907

Pith/arXiv arXiv 2026
[51]

Navforesee: A unified vision-language world model for hierarchi- cal planning and dual-horizon navigation prediction

Fei Liu, Shichao Xie, Minghua Luo, Zedong Chu, Junjun Hu, Xiaolong Wu, and Mu Xu. Navforesee: A unified vision-language world model for hierarchi- cal planning and dual-horizon navigation prediction. arXiv preprint arXiv:2512.01550, 2025

arXiv 2025
[52]

Astranav-world: World model for foresight control and consistency.arXiv preprint arXiv:2512.21714, 2025

Junjun Hu, Jintao Chen, Haochen Bai, Minghua Luo, Shichao Xie, Ziyi Chen, Fei Liu, Zedong Chu, Xinda Xue, Botao Ren, et al. Astranav-world: World model for foresight control and consistency.arXiv preprint arXiv:2512.21714, 2025

Pith/arXiv arXiv 2025
[53]

Nav- morph: A self-evolving world model for vision-and- language navigation in continuous environments

Xuan Yao, Junyu Gao, and Changsheng Xu. Nav- morph: A self-evolving world model for vision-and- language navigation in continuous environments. arXiv preprint arXiv:2506.23468, 2025

arXiv 2025
[54]

Oa-wam: Object- addressable world action model for robust robot ma- nipulation.arXiv preprint arXiv:2605.06481, 2026

Yichen Liu, Peng Sun, Shuo Li, Yuxuan Xie, Lingfeng Zhang, Xingyu Chao, Siyuan Dong, Fei Chen, Xiaoping Zhang, et al. Oa-wam: Object- addressable world action model for robust robot ma- nipulation.arXiv preprint arXiv:2605.06481, 2026. URLhttps://arxiv.org/abs/2605.06481

Pith/arXiv arXiv 2026
[55]

Thinking in text and im- ages: Interleaved vision-language reasoning traces for long-horizon robot manipulation.arXiv preprint arXiv:2605.00438, 2026

Jiahao Liu, Haoran Chi, Lingfeng Zhang, Yuxuan Xie, Yanan Wang, Long Chen, Hang Ye, Xiaoshuai Hao, and Wenbo Ding. Thinking in text and im- ages: Interleaved vision-language reasoning traces for long-horizon robot manipulation.arXiv preprint arXiv:2605.00438, 2026. URL https://arxiv.org/ abs/2605.00438

Pith/arXiv arXiv 2026
[56]

Reasoning emerges from constrained inference manifolds in large language models.arXiv preprint arXiv:2605.08142, 2026

Yuchen Ma, Fei Luo, Lingfeng Zhang, Chen Zhao, Ming Wang, Yifan Wu, Ziyu Qian, Yao Lu, Long Chen, et al. Reasoning emerges from constrained inference manifolds in large language models.arXiv preprint arXiv:2605.08142, 2026. URL https:// arxiv.org/abs/2605.08142

Pith/arXiv arXiv 2026
[57]

Sef-map: Subspace-decomposed expert fusion for robust multimodal hd map predic- tion

Haoxiang Fu, Lingfeng Zhang, Haoran Li, Rui Hu, Zi- han Li, Guoqing Liu, Zeyu Tan, Long Chen, Hang Ye, and Xiaoshuai Hao. Sef-map: Subspace-decomposed expert fusion for robust multimodal hd map predic- tion. InIEEE International Conference on Robotics and Automation, 2026

2026
[59]

URLhttps://arxiv.org/abs/2604.05405

Pith/arXiv arXiv
[60]

Pathdreamer: A world model for indoor navigation.arXiv preprint arXiv:2105.08756, 2021

Jing Yu Koh, Honglak Lee, Yinfei Yang, Jason Baldridge, and Peter Anderson. Pathdreamer: A world model for indoor navigation.arXiv preprint arXiv:2105.08756, 2021. URL https://arxiv.org/ abs/2105.08756

arXiv 2021
[61]

Vistav2: World imagination for indoor vision-and-language naviga- tion.arXiv preprint arXiv:2512.00041, 2025

Yanjia Huang, Xianshun Jiang, Xiangbo Gao, Mingyang Wu, and Zhengzhong Tu. Vistav2: World imagination for indoor vision-and-language naviga- tion.arXiv preprint arXiv:2512.00041, 2025. URL https://arxiv.org/abs/2512.00041

arXiv 2025
[62]

Navcot: Boosting llm- based vision-and-language navigation via learning disentangled reasoning.IEEE Transactions on Pat- tern Analysis and Machine Intelligence, 2025

Bingqian Lin, Yunshuang Nie, Ziming Wei, Jiaqi Chen, Shikui Ma, Jianhua Han, Hang Xu, Xiaojun Chang, and Xiaodan Liang. Navcot: Boosting llm- based vision-and-language navigation via learning disentangled reasoning.IEEE Transactions on Pat- tern Analysis and Machine Intelligence, 2025. URL https://arxiv.org/abs/2403.07376

arXiv 2025
[63]

Monodream: Monocular vision-language nav- igation with panoramic dreaming.arXiv preprint arXiv:2508.02549, 2025

Shuo Wang, Yongcai Wang, Wanting Li, Yucheng Wang, Maiyue Chen, Kaihui Wang, Zhizhong Su, Xudong Cai, Yeying Jin, Deying Li, and Zhaoxin Fan. Monodream: Monocular vision-language nav- igation with panoramic dreaming.arXiv preprint arXiv:2508.02549, 2025. URL https://arxiv.org/ abs/2508.02549

arXiv 2025
[64]

Fast-wam: Do world action models need test-time future imagination?arXiv preprint arXiv:2603.16666, 2026

Tianyuan Yuan, Zibin Dong, Yicheng Liu, and Hang Zhao. Fast-wam: Do world action models need test-time future imagination?arXiv preprint arXiv:2603.16666, 2026

Pith/arXiv arXiv 2026
[65]

Spaact: Spatially- activated transition learning with curriculum adap- tation for vision-language navigation.arXiv preprint arXiv:2604.27620, 2026

Pengna Li, Kangyi Wu, Shaoqing Xu, Fang Li, Han- bing Li, Lin Zhao, Kailin Lyu, Long Chen, Zhi- Xin Yang, and Nanning Zheng. Spaact: Spatially- activated transition learning with curriculum adap- tation for vision-language navigation.arXiv preprint arXiv:2604.27620, 2026

Pith/arXiv arXiv 2026
[66]

Vggt: Visual geometry grounded trans- former

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded trans- former. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5294–5306, 2025

2025
[67]

Sim-2-sim transfer for vision-and-language navigation in continuous en- vironments

Jacob Krantz and Stefan Lee. Sim-2-sim transfer for vision-and-language navigation in continuous en- vironments. InEuropean conference on computer vision, pages 588–603. Springer, 2022. 13

2022
[68]

1st place solutions for rxr-habitat vision-and- language navigation competition (cvpr 2022).arXiv preprint arXiv:2206.11610, 2022

Dong An, Zun Wang, Yangguang Li, Yi Wang, Yi- cong Hong, Yan Huang, Liang Wang, and Jing Shao. 1st place solutions for rxr-habitat vision-and- language navigation competition (cvpr 2022).arXiv preprint arXiv:2206.11610, 2022

arXiv 2022
[69]

Instructnav: Zero-shot system for generic instruction naviga- tion in unexplored environment.arXiv preprint arXiv:2406.04882, 2024

Yuxing Long, Wenzhe Cai, Hongcheng Wang, Guanqi Zhan, and Hao Dong. Instructnav: Zero-shot system for generic instruction naviga- tion in unexplored environment.arXiv preprint arXiv:2406.04882, 2024

arXiv 2024
[70]

Cosmo: Combina- tion of selective memorization for low-cost vision- and-language navigation

Siqi Zhang, Yanyuan Qiao, Qunbo Wang, Zike Yan, Qi Wu, Zhihua Wei, and Jing Liu. Cosmo: Combina- tion of selective memorization for low-cost vision- and-language navigation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 5511–5522, 2025

2025
[71]

Affordances- oriented planning using foundation models for con- tinuous vision-language navigation

Jiaqi Chen, Bingqian Lin, Xinmin Liu, Lin Ma, Xi- aodan Liang, and Kwan-Yee K Wong. Affordances- oriented planning using foundation models for con- tinuous vision-language navigation. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 23568–23576, 2025

2025
[72]

Language-aligned waypoint (law) supervision for vision-and-language navigation in continuous environments

Sonia Raychaudhuri, Saim Wani, Shivansh Patel, Unnat Jain, and Angel Chang. Language-aligned waypoint (law) supervision for vision-and-language navigation in continuous environments. InProceed- ings of the 2021 conference on empirical methods in natural language processing, pages 4018–4028, 2021

2021
[73]

g3d-lf: Generalizable 3d-language feature fields for embodied tasks

Zihan Wang and Gim Hee Lee. g3d-lf: Generalizable 3d-language feature fields for embodied tasks. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14191–14202, 2025

2025
[74]

Na vid-4d: Unleashing spatial intelligence in egocentric rgb-d videos for vision-and-language navigation

Haoran Liu, Weikang Wan, Xiqian Yu, Minghan Li, Jiazhao Zhang, Bo Zhao, Zhibo Chen, Zhongyuan Wang, Zhizheng Zhang, and He Wang. Na vid-4d: Unleashing spatial intelligence in egocentric rgb-d videos for vision-and-language navigation. In2025 IEEE International Conference on Robotics and Au- tomation (ICRA), pages 10607–10615. IEEE, 2025

2025
[75]

Sim-to-real transfer via 3d fea- ture fields for vision-and-language navigation.arXiv preprint arXiv:2406.09798, 2024

Zihan Wang, Xiangyang Li, Jiahao Yang, Yeqi Liu, and Shuqiang Jiang. Sim-to-real transfer via 3d fea- ture fields for vision-and-language navigation.arXiv preprint arXiv:2406.09798, 2024

arXiv 2024
[76]

Ground slow, move fast: A dual-system foundation model for generalizable vision-and-language navigation.arXiv preprint arXiv:2512.08186, 2025

Meng Wei, Chenyang Wan, Jiaqi Peng, Xiqian Yu, Yuqiang Yang, Delin Feng, Wenzhe Cai, Chenming Zhu, Tai Wang, Jiangmiao Pang, et al. Ground slow, move fast: A dual-system foundation model for generalizable vision-and-language navigation.arXiv preprint arXiv:2512.08186, 2025

arXiv 2025
[77]

Habitat: A platform for embodied ai research

Manolis Savva, Abhishek Kadian, Oleksandr Maksymets, Yili Zhao, Erik Wijmans, Bhavana Jain, Julian Straub, Jia Liu, Vladlen Koltun, Jitendra Malik, et al. Habitat: A platform for embodied ai research. InProceedings of the IEEE/CVF in- ternational conference on computer vision, pages 9339–9347, 2019

2019
[78]

Matter- port3d: Learning from rgb-d data in indoor environ- ments.arXiv preprint arXiv:1709.06158, 2017

Angel Chang, Angela Dai, Thomas Funkhouser, Ma- ciej Halber, Matthias Niessner, Manolis Savva, Shu- ran Song, Andy Zeng, and Yinda Zhang. Matter- port3d: Learning from rgb-d data in indoor environ- ments.arXiv preprint arXiv:1709.06158, 2017

Pith/arXiv arXiv 2017
[79]

Qwen3-vl technical report, 2025

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhi- fang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Ming- sheng Li, Mei Li, Kaixin Li, Zicheng Lin, Jun- yang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Sh...

Pith/arXiv arXiv 2025
[80]

Scaling data generation in vision-and-language nav- igation

Zun Wang, Jialu Li, Yicong Hong, Yi Wang, Qi Wu, MohitBansal, StephenGould, HaoTan, andYuQiao. Scaling data generation in vision-and-language nav- igation. InProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision, pages 12009– 12020, 2023

2023
[81]

A reduction of imitation learning and structured prediction to no-regret online learning

Stéphane Ross, Geoffrey Gordon, and Drew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. InProceed- ings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pages 627–635. JMLR Workshop and Conference Proceedings, 2011

2011
[82]

Kingma and Max Welling

Diederik P. Kingma and Max Welling. Auto- encoding variational bayes. InInternational Confer- ence on Learning Representations, 2014

2014
[83]

Emerging properties in self-supervised vision transformers

Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. InProceedings of the IEEE/CVF In- ternational Conference on Computer Vision, pages 9650–9660, 2021. 14

2021

[1] [1]

Vision-and- language navigation: Interpreting visually-grounded navigation instructions in real environments

Peter Anderson, Qi Wu, Damien Teney, Jake Bruce, Mark Johnson, Niko Sünderhauf, Ian Reid, Stephen Gould, and Anton Van Den Hengel. Vision-and- language navigation: Interpreting visually-grounded navigation instructions in real environments. InPro- ceedings of the IEEE conference on computer vision and pattern recognition, pages 3674–3683, 2018

2018

[2] [2]

Beyond the nav-graph: Vision-and-language navigation in continuous envi- ronments

Jacob Krantz, Erik Wijmans, Arjun Majumdar, Dhruv Batra, and Stefan Lee. Beyond the nav-graph: Vision-and-language navigation in continuous envi- ronments. InEuropean Conference on Computer Vision, pages 104–120. Springer, 2020

2020

[3] [3]

Room-across-room: Multilingual vision-and-language navigation with dense spatiotemporal grounding.arXiv preprint arXiv:2010.07954, 2020

Alexander Ku, Peter Anderson, Roma Patel, Eu- gene Ie, and Jason Baldridge. Room-across-room: Multilingual vision-and-language navigation with dense spatiotemporal grounding.arXiv preprint arXiv:2010.07954, 2020

arXiv 2010

[4] [4]

Waypoint mod- els for instruction-guided navigation in continuous environments

Jacob Krantz, Aaron Gokaslan, Dhruv Batra, Stefan Lee, and Oleksandr Maksymets. Waypoint mod- els for instruction-guided navigation in continuous environments. InProceedings of the IEEE/CVF In- ternational Conference on Computer Vision, pages 15162–15171, 2021

2021

[5] [5]

Bridging the gap between learning in discrete andcontinuousenvironmentsforvision-and-language navigation

Yicong Hong, Zun Wang, Qi Wu, and Stephen Gould. Bridging the gap between learning in discrete andcontinuousenvironmentsforvision-and-language navigation. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 15439–15449, 2022

2022

[6] [6]

Learning navigational visual representations with semantic map supervision

Yicong Hong, Yang Zhou, Ruiyi Zhang, Franck Der- noncourt, Trung Bui, Stephen Gould, and Hao Tan. Learning navigational visual representations with semantic map supervision. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 3055–3067, 2023

2023

[7] [7]

Gridmm: Grid memory map for vision-and-language navigation

Zihan Wang, Xiangyang Li, Jiahao Yang, Yeqi Liu, and Shuqiang Jiang. Gridmm: Grid memory map for vision-and-language navigation. InProceedings of the IEEE/CVF International conference on computer vision, pages 15625–15636, 2023

2023

[8] [8]

Dreamwalker: Mental planning for continuous vision-language navigation

Hanqing Wang, Wei Liang, Luc Van Gool, and Wen- guan Wang. Dreamwalker: Mental planning for continuous vision-language navigation. InProceed- ings of the IEEE/CVF international conference on computer vision, pages 10873–10883, 2023

2023

[9] [9]

Mapnav: A novel memory representation via annotated semantic maps for vlm-based vision-and- language navigation

Lingfeng Zhang, Xiaoshuai Hao, Qinwen Xu, Qiang Zhang, Xinyao Zhang, Pengwei Wang, Jing Zhang, Zhongyuan Wang, Shanghang Zhang, and Renjing Xu. Mapnav: A novel memory representation via annotated semantic maps for vlm-based vision-and- language navigation. InProceedings of the 63rd Annual Meeting of the Association for Computational 10 Linguistics (Volume...

2025

[10] [10]

Toponav: Topological graphs as a key enabler for advanced object navigation

Peng Liu, Qiang Zhang, Di Peng, Lingfeng Zhang, Yiran Qin, Huan Zhou, Jiajun Ma, Renjing Xu, and Yandong Ji. Toponav: Topological graphs as a key enabler for advanced object navigation. InIEEE In- ternational Conference on Robotics and Automation, 2026

2026

[11] [11]

Trihelper: Zero-shot object navigation with dynamic assistance

Lingfeng Zhang, Qiang Zhang, Hao Wang, Erjia Xiao, Zixuan Jiang, Honglei Chen, and Renjing Xu. Trihelper: Zero-shot object navigation with dynamic assistance. InIEEE/RSJ International Conference on Intelligent Robots and Systems, 2024. URLhttps: //arxiv.org/abs/2403.15223

arXiv 2024

[12] [12]

Multi-floor zero-shot object navigation policy

Lingfeng Zhang, Hanqing Wang, Enyu Xiao, Xinyao Zhang, Qiang Zhang, Zihan Jiang, and Renjing Xu. Multi-floor zero-shot object navigation policy. In IEEE International Conference on Robotics and Au- tomation, 2025

2025

[13] [13]

Stairway to success: An online floor- awarezero-shotobject-goalnavigationframeworkvia llm-driven coarse-to-fine exploration.IEEE Robotics and Automation Letters, 2026

Zeying Gong, Rui Li, Tianyu Hu, Ruofei Qiu, Linghe Kong, Lingfeng Zhang, Guoqing Zhao, Yu Ding, and Junwei Liang. Stairway to success: An online floor- awarezero-shotobject-goalnavigationframeworkvia llm-driven coarse-to-fine exploration.IEEE Robotics and Automation Letters, 2026

2026

[14] [14]

Socialnav-map: Dynamic mapping with human tra- jectory prediction for zero-shot social navigation

Lingfeng Zhang, Erjia Xiao, Xiaoshuai Hao, Haox- iang Fu, Zeying Gong, Long Chen, Xiaojun Liang, Renjing Xu, Hangjun Ye, and Wenbo Ding. Socialnav-map: Dynamic mapping with human tra- jectory prediction for zero-shot social navigation. arXiv preprint arXiv:2511.12232, 2025. URLhttps: //arxiv.org/abs/2511.12232

arXiv 2025

[15] [15]

Walk with me: Long- horizon social navigation for human-centric outdoor assistance.arXiv preprint arXiv:2604.26839, 2026

Lingfeng Zhang, Xiaoshuai Hao, Xinyu Bu, Yingbo Tang, Haoran Li, Jian Lu, Xinyu Wei, Jiajun Ma, Yichen Liu, Jing Zhang, et al. Walk with me: Long- horizon social navigation for human-centric outdoor assistance.arXiv preprint arXiv:2604.26839, 2026. URLhttps://arxiv.org/abs/2604.26839

Pith/arXiv arXiv 2026

[16] [16]

The robosense challenge: Sense anything, navigate anywhere, adapt across platforms.arXiv preprint, 2026

Linghe Kong, Sicheng Xie, Zeying Gong, Yuxiang Li, Min Chu, An Liang, Yuhang Dong, Tianyu Hu, Ruofei Qiu, Rui Li, et al. The robosense challenge: Sense anything, navigate anywhere, adapt across platforms.arXiv preprint, 2026

2026

[17] [17]

Team xiaomi ev-ad vla: Caption- guided retrieval system for cross-modal drone nav- igation – technical report for iros 2025 robosense challenge track 4

Lingfeng Zhang, Enyu Xiao, Yifan Zhang, Haoxiang Fu, Rui Hu, Yuchen Ma, Wenbo Ding, Long Chen, Hang Ye, et al. Team xiaomi ev-ad vla: Caption- guided retrieval system for cross-modal drone nav- igation – technical report for iros 2025 robosense challenge track 4. Technical report, IROS 2025 Ro- boSense Challenge, 2025

2025

[18] [18]

Learning to navigate socially through proactive risk perception – technical report for iros 2025 robosense challenge social navigation track

Enyu Xiao, Lingfeng Zhang, Yingbo Tang, Haoran Cheng, Renjing Xu, Wenbo Ding, Li Zhou, Long Chen, Hang Ye, et al. Learning to navigate socially through proactive risk perception – technical report for iros 2025 robosense challenge social navigation track. Technical report, IROS 2025 RoboSense Chal- lenge, 2025

2025

[19] [19]

Mapfusion: A novel bev feature fusion network for multi-modal map construc- tion.Information Fusion, 119:103018, 2025

Xiaoshuai Hao, Yunfeng Diao, Mengchuan Wei, Yi- fan Yang, Peng Hao, Rong Yin, Hui Zhang, Weiming Li, Shu Zhao, and Yu Liu. Mapfusion: A novel bev feature fusion network for multi-modal map construc- tion.Information Fusion, 119:103018, 2025

2025

[20] [20]

Synergistic prompting for comple- mentarity and consistency in incomplete multi-view clustering.IEEE Transactions on Image Processing, 2026

Xiaoshuai Hao, Zhihui Zhang, Yingbo Tang, Lingfeng Zhang, Peng Hao, Yunfeng Diao, Guangyin Jin, and Yu Liu. Synergistic prompting for comple- mentarity and consistency in incomplete multi-view clustering.IEEE Transactions on Image Processing, 2026

2026

[21] [21]

Embodied spatial affordance: spatial-aware affordance learning for embodied nav- igation and manipulation.IEEE Transactions on Image Processing, 2026

Xiaoshuai Hao, Yingbo Tang, Lingfeng Zhang, Long Chen, Wei Zhou, Jungong Han, Wenbo Ding, and Xiao-Ping Zhang. Embodied spatial affordance: spatial-aware affordance learning for embodied nav- igation and manipulation.IEEE Transactions on Image Processing, 2026

2026

[22] [22]

Embodiedplan- 1k: A benchmark for complex navigation- manipulation task planning

Lingfeng Zhang, Yingbo Tang, Xinyu Zheng, Liang Li, Jinglin Xu, and Xiaoshuai Hao. Embodiedplan- 1k: A benchmark for complex navigation- manipulation task planning. 2026

2026

[23] [23]

Navid: Video-based vlm plans the next step for vision-and-language nav- igation.arXiv preprint arXiv:2402.15852, 2024

Jiazhao Zhang, Kunyu Wang, Rongtao Xu, Gengze Zhou, Yicong Hong, Xiaomeng Fang, Qi Wu, Zhizheng Zhang, and He Wang. Navid: Video-based vlm plans the next step for vision-and-language nav- igation.arXiv preprint arXiv:2402.15852, 2024

Pith/arXiv arXiv 2024

[24] [24]

Uni-navid: A video-based vision-language-action model for uni- fying embodied navigation tasks.arXiv preprint arXiv:2412.06224, 2024

Jiazhao Zhang, Kunyu Wang, Shaoan Wang, Ming- han Li, Haoran Liu, Songlin Wei, Zhongyuan Wang, Zhizheng Zhang, and He Wang. Uni-navid: A video-based vision-language-action model for uni- fying embodied navigation tasks.arXiv preprint arXiv:2412.06224, 2024

Pith/arXiv arXiv 2024

[25] [25]

Navila: Legged robot vision-language-action model for navigation.arXiv preprint arXiv:2412.04453, 2024

An-Chieh Cheng, Yandong Ji, Zhaojing Yang, Zaitian Gongye, Xueyan Zou, Jan Kautz, Erdem Bıyık, Hongxu Yin, Sifei Liu, and Xiaolong Wang. Navila: Legged robot vision-language-action model for navigation.arXiv preprint arXiv:2412.04453, 2024

arXiv 2024

[26] [26]

Streamvln: Streaming vision-and-language naviga- tion via slowfast context modeling.arXiv preprint arXiv:2507.05240, 2025

Meng Wei, Chenyang Wan, Xiqian Yu, Tai Wang, Yuqiang Yang, Xiaohan Mao, Chenming Zhu, Wenzhe Cai, Hanqing Wang, Yilun Chen, et al. Streamvln: Streaming vision-and-language naviga- tion via slowfast context modeling.arXiv preprint arXiv:2507.05240, 2025

arXiv 2025

[27] [27]

Embodied Navigation Foundation Model

Jiazhao Zhang, Anqi Li, Yunpeng Qi, Minghan Li, Jiahang Liu, Shaoan Wang, Haoran Liu, Gengze Zhou, Yuze Wu, Xingxing Li, Yuxin Fan, Wenjun Li, Zhibo Chen, Fei Gao, Qi Wu, Zhizheng Zhang, and He Wang. Embodied Navigation Foundation Model. InInternational Conference on Learning Representations, 2026. URL https://openreview. net/forum?id=kkBOIsrCXh. 11

2026

[28] [29]

URLhttps://arxiv.org/abs/2508.04598

arXiv

[29] [30]

Robobrain 2.0 technical report.arXiv preprint arXiv:2507.02029, 2025

BAAI RoboBrain Team, Mingyu Cao, Huajie Tan, Yuheng Ji, Xiansheng Chen, Minglan Lin, Zhiyu Li, Zhou Cao, Pengwei Wang, Enshen Zhou, et al. Robobrain 2.0 technical report.arXiv preprint arXiv:2507.02029, 2025. URL https://arxiv.org/ abs/2507.02029

arXiv 2025

[30] [31]

Mimo-embodied: X- embodied foundation model technical report.arXiv preprint arXiv:2511.16518, 2025

Xiaoshuai Hao, Lei Zhou, Zhijian Huang, Zhiwen Hou, Yingbo Tang, Lingfeng Zhang, Guang Li, Zheng Lu, Shuhuai Ren, et al. Mimo-embodied: X- embodied foundation model technical report.arXiv preprint arXiv:2511.16518, 2025. URL https:// arxiv.org/abs/2511.16518

Pith/arXiv arXiv 2025

[31] [32]

Onevla: A uni- fied framework for embodied tasks.arXiv preprint arXiv:2606.01241, 2026

Lingfeng Zhang, Xiaoshuai Hao, Yingbo Tang, Lei Zhou, Shuyi Zhang, Jinkun Liu, Hongsheng Li, Chen- hao Zhang, Qiang Zhang, Hangjun Ye, Xiaojun Liang, Long Chen, and Wenbo Ding. Onevla: A uni- fied framework for embodied tasks.arXiv preprint arXiv:2606.01241, 2026. URL https://arxiv.org/ abs/2606.01241

Pith/arXiv arXiv 2026

[32] [33]

Onevl: One-step latent reasoning and planning with vision-language explanation.arXiv preprint arXiv:2604.18486, 2026

Jian Lu, Jian Guan, Zhaorui Huang, Jiacheng Li, Guoqiang Li, Linghe Kong, Yuxiang Li, Hao- ran Wang, Shiyu Xu, Yifan Luo, Fei Li, et al. Onevl: One-step latent reasoning and planning with vision-language explanation.arXiv preprint arXiv:2604.18486, 2026. URL https://arxiv.org/ abs/2604.18486

Pith/arXiv arXiv 2026

[33] [34]

Video-cot: A com- prehensive dataset for spatiotemporal understanding of videos based on chain-of-thought

Shilong Zhang, Xiaoshuai Hao, Yingbo Tang, Lingfeng Zhang, Pengwei Wang, Zhongyuan Wang, Huan Ma, and Shanghang Zhang. Video-cot: A com- prehensive dataset for spatiotemporal understanding of videos based on chain-of-thought. InProceedings of the ACM International Conference on Multimedia, 2025

2025

[34] [35]

Is your vlm sky-ready? a comprehensive spatial intelligence benchmark for uav navigation

Lingfeng Zhang, Yifan Zhang, Haoran Li, Haoxiang Fu, Yingbo Tang, Hang Ye, Long Chen, Xiaojun Liang, Xiaoshuai Hao, et al. Is your vlm sky-ready? a comprehensive spatial intelligence benchmark for uav navigation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recog- nition, 2026

2026

[35] [37]

URLhttps://arxiv.org/abs/2503.09010

arXiv

[36] [38]

Roboafford: A dataset and benchmark for enhancing object and spatial affordance learning in robot manipulation

Yingbo Tang, Lingfeng Zhang, Shilong Zhang, Yifan Zhao, and Xiaoshuai Hao. Roboafford: A dataset and benchmark for enhancing object and spatial affordance learning in robot manipulation. InPro- ceedings of the ACM International Conference on Multimedia, 2025

2025

[37] [39]

Roboafford++: A gener- ative ai-enhanced dataset for multimodal affordance learning in robotic manipulation and navigation

Xiaoshuai Hao, Yingbo Tang, Lingfeng Zhang, Yuchen Ma, Yuhang Diao, Zihan Jia, Wenbo Ding, Hang Ye, and Long Chen. Roboafford++: A gener- ative ai-enhanced dataset for multimodal affordance learning in robotic manipulation and navigation. In IEEE/RSJ International Conference on Intelligent Robots and Systems Workshop on RoDGE, 2025

2025

[38] [40]

Evaluating gpt-4o’s embodied intelligence: A comprehensive empirical study

Yifan Wu, Haoran Lyu, Yingbo Tang, Lingfeng Zhang, Ziheng Zhang, Wenxuan Zhou, and Shibo Hao. Evaluating gpt-4o’s embodied intelligence: A comprehensive empirical study. Technical report, Technical Report, 2025

2025

[39] [41]

Exploring typographic visual prompts injection threats in cross-modality genera- tion models

Haoran Cheng, Enyu Xiao, Yujie Wang, Lingfeng Zhang, Kaidi Xu, Meng Sun, Xiaoshuai Hao, Jinjin Gu, and Renjing Xu. Exploring typographic visual prompts injection threats in cross-modality genera- tion models. InInternational Joint Conference on Artificial Intelligence Workshop on Deepfake Detec- tion, Localization and Interpretability, 2025

2025

[40] [42]

Vquala 2025 challenge on engagement prediction for short videos: Methods and results

Dong Li, Shuang Ma, Hang Hua, Wei Li, Jian Wang, Chengwei Zhou, Feng Guan, Xin Li, Zhi Yu, Yao Lu, et al. Vquala 2025 challenge on engagement prediction for short videos: Methods and results. In IEEE/CVF International Conference on Computer Vision Workshop, 2025

2025

[41] [43]

H2r-bm: Can leveraging human videos enhance per- formance and generalizability in robotic bimanual manipulation?Pattern Recognition, page 113637, 2026

Xiaoshuai Hao, Haoran Lyu, Lingfeng Zhang, Ruidong Liu, Di Wu, Jing Zhang, and Long Chen. H2r-bm: Can leveraging human videos enhance per- formance and generalizability in robotic bimanual manipulation?Pattern Recognition, page 113637, 2026

2026

[42] [44]

What you see is what you reach: Towards spatial navigation with high-level human instructions

Lingfeng Zhang, Haoxiang Fu, Xiaoshuai Hao, Shi- long Zhang, Qiang Zhang, Ruidong Liu, Long Chen, and Wenbo Ding. What you see is what you reach: Towards spatial navigation with high-level human instructions. InProceedings of the AAAI Conference on Artificial Intelligence, 2026

2026

[43] [45]

Mesh- mimic: Geometry-aware humanoid motion learning through 3d scene reconstruction.arXiv preprint arXiv:2602.15733, 2026

Qiang Zhang, Jiajun Ma, Peng Liu, Shiyu Shi, Zhiqiang Su, Zhongyuan Wang, Jiaming Sun, Wenx- uan Cui, Jia Yu, Guang Han, et al. Mesh- mimic: Geometry-aware humanoid motion learning through 3d scene reconstruction.arXiv preprint arXiv:2602.15733, 2026. URL https://arxiv.org/ abs/2602.15733

arXiv 2026

[44] [46]

JanusVLN: Decoupling Se- mantics and Spatiality with Dual Implicit Memory for Vision-Language Navigation

Shuang Zeng, Dekang Qi, Xinyuan Chang, Feng Xiong, Shichao Xie, Xiaolong Wu, Shiyi Liang, Mu Xu, and Xing Wei. JanusVLN: Decoupling Se- mantics and Spatiality with Dual Implicit Memory for Vision-Language Navigation. InInternational Conference on Learning Representations, 2026. URL https://openreview.net/forum?id=RnuB0Nlbd5. 12

2026

[45] [47]

Navigation world models

AmirBar, GaoyueZhou, DannyTran, TrevorDarrell, and Yann LeCun. Navigation world models. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15791–15801,

[46] [48]

URLhttps://arxiv.org/abs/2412.03572

arXiv

[47] [49]

Worldvln: Autoregressive world action model for aerial vision-language navigation.arXiv preprint arXiv:2605.15964, 2026

Baining Zhao, Jiacheng Xu, Weicheng Feng, Xin Zhang, Zhaolu Wang, Haoyang Wang, Shilong Ji, Ziyou Wang, Jianjie Fang, Zhiheng Zheng, Weichen Zhang, Yu Shang, Wei Wu, Chen Gao, Xinlei Chen, and Yong Li. Worldvln: Autoregressive world action model for aerial vision-language navigation.arXiv preprint arXiv:2605.15964, 2026. URL https:// arxiv.org/abs/2605.15964

Pith/arXiv arXiv 2026

[48] [50]

Wam-nav: Asymmetric latent world-action model- ing for unified visual navigation.arXiv preprint arXiv:2606.04907, 2026

Ning Yang, Yan Huang, Kaiwen Peng, Ziheng He, Kai Wang, Cui Miao, Kailin Lyu, Guo Li, Xiaofeng Wang, Zheng Zhu, Jing Liu, and Nianfeng Liu. Wam-nav: Asymmetric latent world-action model- ing for unified visual navigation.arXiv preprint arXiv:2606.04907, 2026. URL https://arxiv.org/ abs/2606.04907

Pith/arXiv arXiv 2026

[49] [51]

Navforesee: A unified vision-language world model for hierarchi- cal planning and dual-horizon navigation prediction

Fei Liu, Shichao Xie, Minghua Luo, Zedong Chu, Junjun Hu, Xiaolong Wu, and Mu Xu. Navforesee: A unified vision-language world model for hierarchi- cal planning and dual-horizon navigation prediction. arXiv preprint arXiv:2512.01550, 2025

arXiv 2025

[50] [52]

Astranav-world: World model for foresight control and consistency.arXiv preprint arXiv:2512.21714, 2025

Junjun Hu, Jintao Chen, Haochen Bai, Minghua Luo, Shichao Xie, Ziyi Chen, Fei Liu, Zedong Chu, Xinda Xue, Botao Ren, et al. Astranav-world: World model for foresight control and consistency.arXiv preprint arXiv:2512.21714, 2025

Pith/arXiv arXiv 2025

[51] [53]

Nav- morph: A self-evolving world model for vision-and- language navigation in continuous environments

Xuan Yao, Junyu Gao, and Changsheng Xu. Nav- morph: A self-evolving world model for vision-and- language navigation in continuous environments. arXiv preprint arXiv:2506.23468, 2025

arXiv 2025

[52] [54]

Oa-wam: Object- addressable world action model for robust robot ma- nipulation.arXiv preprint arXiv:2605.06481, 2026

Yichen Liu, Peng Sun, Shuo Li, Yuxuan Xie, Lingfeng Zhang, Xingyu Chao, Siyuan Dong, Fei Chen, Xiaoping Zhang, et al. Oa-wam: Object- addressable world action model for robust robot ma- nipulation.arXiv preprint arXiv:2605.06481, 2026. URLhttps://arxiv.org/abs/2605.06481

Pith/arXiv arXiv 2026

[53] [55]

Thinking in text and im- ages: Interleaved vision-language reasoning traces for long-horizon robot manipulation.arXiv preprint arXiv:2605.00438, 2026

Jiahao Liu, Haoran Chi, Lingfeng Zhang, Yuxuan Xie, Yanan Wang, Long Chen, Hang Ye, Xiaoshuai Hao, and Wenbo Ding. Thinking in text and im- ages: Interleaved vision-language reasoning traces for long-horizon robot manipulation.arXiv preprint arXiv:2605.00438, 2026. URL https://arxiv.org/ abs/2605.00438

Pith/arXiv arXiv 2026

[54] [56]

Reasoning emerges from constrained inference manifolds in large language models.arXiv preprint arXiv:2605.08142, 2026

Yuchen Ma, Fei Luo, Lingfeng Zhang, Chen Zhao, Ming Wang, Yifan Wu, Ziyu Qian, Yao Lu, Long Chen, et al. Reasoning emerges from constrained inference manifolds in large language models.arXiv preprint arXiv:2605.08142, 2026. URL https:// arxiv.org/abs/2605.08142

Pith/arXiv arXiv 2026

[55] [57]

Sef-map: Subspace-decomposed expert fusion for robust multimodal hd map predic- tion

Haoxiang Fu, Lingfeng Zhang, Haoran Li, Rui Hu, Zi- han Li, Guoqing Liu, Zeyu Tan, Long Chen, Hang Ye, and Xiaoshuai Hao. Sef-map: Subspace-decomposed expert fusion for robust multimodal hd map predic- tion. InIEEE International Conference on Robotics and Automation, 2026

2026

[56] [59]

URLhttps://arxiv.org/abs/2604.05405

Pith/arXiv arXiv

[57] [60]

Pathdreamer: A world model for indoor navigation.arXiv preprint arXiv:2105.08756, 2021

Jing Yu Koh, Honglak Lee, Yinfei Yang, Jason Baldridge, and Peter Anderson. Pathdreamer: A world model for indoor navigation.arXiv preprint arXiv:2105.08756, 2021. URL https://arxiv.org/ abs/2105.08756

arXiv 2021

[58] [61]

Vistav2: World imagination for indoor vision-and-language naviga- tion.arXiv preprint arXiv:2512.00041, 2025

Yanjia Huang, Xianshun Jiang, Xiangbo Gao, Mingyang Wu, and Zhengzhong Tu. Vistav2: World imagination for indoor vision-and-language naviga- tion.arXiv preprint arXiv:2512.00041, 2025. URL https://arxiv.org/abs/2512.00041

arXiv 2025

[59] [62]

Navcot: Boosting llm- based vision-and-language navigation via learning disentangled reasoning.IEEE Transactions on Pat- tern Analysis and Machine Intelligence, 2025

Bingqian Lin, Yunshuang Nie, Ziming Wei, Jiaqi Chen, Shikui Ma, Jianhua Han, Hang Xu, Xiaojun Chang, and Xiaodan Liang. Navcot: Boosting llm- based vision-and-language navigation via learning disentangled reasoning.IEEE Transactions on Pat- tern Analysis and Machine Intelligence, 2025. URL https://arxiv.org/abs/2403.07376

arXiv 2025

[60] [63]

Monodream: Monocular vision-language nav- igation with panoramic dreaming.arXiv preprint arXiv:2508.02549, 2025

Shuo Wang, Yongcai Wang, Wanting Li, Yucheng Wang, Maiyue Chen, Kaihui Wang, Zhizhong Su, Xudong Cai, Yeying Jin, Deying Li, and Zhaoxin Fan. Monodream: Monocular vision-language nav- igation with panoramic dreaming.arXiv preprint arXiv:2508.02549, 2025. URL https://arxiv.org/ abs/2508.02549

arXiv 2025

[61] [64]

Fast-wam: Do world action models need test-time future imagination?arXiv preprint arXiv:2603.16666, 2026

Tianyuan Yuan, Zibin Dong, Yicheng Liu, and Hang Zhao. Fast-wam: Do world action models need test-time future imagination?arXiv preprint arXiv:2603.16666, 2026

Pith/arXiv arXiv 2026

[62] [65]

Spaact: Spatially- activated transition learning with curriculum adap- tation for vision-language navigation.arXiv preprint arXiv:2604.27620, 2026

Pengna Li, Kangyi Wu, Shaoqing Xu, Fang Li, Han- bing Li, Lin Zhao, Kailin Lyu, Long Chen, Zhi- Xin Yang, and Nanning Zheng. Spaact: Spatially- activated transition learning with curriculum adap- tation for vision-language navigation.arXiv preprint arXiv:2604.27620, 2026

Pith/arXiv arXiv 2026

[63] [66]

Vggt: Visual geometry grounded trans- former

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded trans- former. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5294–5306, 2025

2025

[64] [67]

Sim-2-sim transfer for vision-and-language navigation in continuous en- vironments

Jacob Krantz and Stefan Lee. Sim-2-sim transfer for vision-and-language navigation in continuous en- vironments. InEuropean conference on computer vision, pages 588–603. Springer, 2022. 13

2022

[65] [68]

1st place solutions for rxr-habitat vision-and- language navigation competition (cvpr 2022).arXiv preprint arXiv:2206.11610, 2022

Dong An, Zun Wang, Yangguang Li, Yi Wang, Yi- cong Hong, Yan Huang, Liang Wang, and Jing Shao. 1st place solutions for rxr-habitat vision-and- language navigation competition (cvpr 2022).arXiv preprint arXiv:2206.11610, 2022

arXiv 2022

[66] [69]

Instructnav: Zero-shot system for generic instruction naviga- tion in unexplored environment.arXiv preprint arXiv:2406.04882, 2024

Yuxing Long, Wenzhe Cai, Hongcheng Wang, Guanqi Zhan, and Hao Dong. Instructnav: Zero-shot system for generic instruction naviga- tion in unexplored environment.arXiv preprint arXiv:2406.04882, 2024

arXiv 2024

[67] [70]

Cosmo: Combina- tion of selective memorization for low-cost vision- and-language navigation

Siqi Zhang, Yanyuan Qiao, Qunbo Wang, Zike Yan, Qi Wu, Zhihua Wei, and Jing Liu. Cosmo: Combina- tion of selective memorization for low-cost vision- and-language navigation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 5511–5522, 2025

2025

[68] [71]

Affordances- oriented planning using foundation models for con- tinuous vision-language navigation

Jiaqi Chen, Bingqian Lin, Xinmin Liu, Lin Ma, Xi- aodan Liang, and Kwan-Yee K Wong. Affordances- oriented planning using foundation models for con- tinuous vision-language navigation. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 23568–23576, 2025

2025

[69] [72]

Language-aligned waypoint (law) supervision for vision-and-language navigation in continuous environments

Sonia Raychaudhuri, Saim Wani, Shivansh Patel, Unnat Jain, and Angel Chang. Language-aligned waypoint (law) supervision for vision-and-language navigation in continuous environments. InProceed- ings of the 2021 conference on empirical methods in natural language processing, pages 4018–4028, 2021

2021

[70] [73]

g3d-lf: Generalizable 3d-language feature fields for embodied tasks

Zihan Wang and Gim Hee Lee. g3d-lf: Generalizable 3d-language feature fields for embodied tasks. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14191–14202, 2025

2025

[71] [74]

Na vid-4d: Unleashing spatial intelligence in egocentric rgb-d videos for vision-and-language navigation

Haoran Liu, Weikang Wan, Xiqian Yu, Minghan Li, Jiazhao Zhang, Bo Zhao, Zhibo Chen, Zhongyuan Wang, Zhizheng Zhang, and He Wang. Na vid-4d: Unleashing spatial intelligence in egocentric rgb-d videos for vision-and-language navigation. In2025 IEEE International Conference on Robotics and Au- tomation (ICRA), pages 10607–10615. IEEE, 2025

2025

[72] [75]

Sim-to-real transfer via 3d fea- ture fields for vision-and-language navigation.arXiv preprint arXiv:2406.09798, 2024

Zihan Wang, Xiangyang Li, Jiahao Yang, Yeqi Liu, and Shuqiang Jiang. Sim-to-real transfer via 3d fea- ture fields for vision-and-language navigation.arXiv preprint arXiv:2406.09798, 2024

arXiv 2024

[73] [76]

Ground slow, move fast: A dual-system foundation model for generalizable vision-and-language navigation.arXiv preprint arXiv:2512.08186, 2025

Meng Wei, Chenyang Wan, Jiaqi Peng, Xiqian Yu, Yuqiang Yang, Delin Feng, Wenzhe Cai, Chenming Zhu, Tai Wang, Jiangmiao Pang, et al. Ground slow, move fast: A dual-system foundation model for generalizable vision-and-language navigation.arXiv preprint arXiv:2512.08186, 2025

arXiv 2025

[74] [77]

Habitat: A platform for embodied ai research

Manolis Savva, Abhishek Kadian, Oleksandr Maksymets, Yili Zhao, Erik Wijmans, Bhavana Jain, Julian Straub, Jia Liu, Vladlen Koltun, Jitendra Malik, et al. Habitat: A platform for embodied ai research. InProceedings of the IEEE/CVF in- ternational conference on computer vision, pages 9339–9347, 2019

2019

[75] [78]

Matter- port3d: Learning from rgb-d data in indoor environ- ments.arXiv preprint arXiv:1709.06158, 2017

Angel Chang, Angela Dai, Thomas Funkhouser, Ma- ciej Halber, Matthias Niessner, Manolis Savva, Shu- ran Song, Andy Zeng, and Yinda Zhang. Matter- port3d: Learning from rgb-d data in indoor environ- ments.arXiv preprint arXiv:1709.06158, 2017

Pith/arXiv arXiv 2017

[76] [79]

Qwen3-vl technical report, 2025

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhi- fang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Ming- sheng Li, Mei Li, Kaixin Li, Zicheng Lin, Jun- yang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Sh...

Pith/arXiv arXiv 2025

[77] [80]

Scaling data generation in vision-and-language nav- igation

Zun Wang, Jialu Li, Yicong Hong, Yi Wang, Qi Wu, MohitBansal, StephenGould, HaoTan, andYuQiao. Scaling data generation in vision-and-language nav- igation. InProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision, pages 12009– 12020, 2023

2023

[78] [81]

A reduction of imitation learning and structured prediction to no-regret online learning

Stéphane Ross, Geoffrey Gordon, and Drew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. InProceed- ings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pages 627–635. JMLR Workshop and Conference Proceedings, 2011

2011

[79] [82]

Kingma and Max Welling

Diederik P. Kingma and Max Welling. Auto- encoding variational bayes. InInternational Confer- ence on Learning Representations, 2014

2014

[80] [83]

Emerging properties in self-supervised vision transformers

Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. InProceedings of the IEEE/CVF In- ternational Conference on Computer Vision, pages 9650–9660, 2021. 14

2021