WorldVLN: Autoregressive World Action Model for Aerial Vision-Language Navigation
Pith reviewed 2026-05-20 17:40 UTC · model grok-4.3
The pith
Predicting short-horizon world-state transitions allows an autoregressive model to decode reliable waypoint actions for aerial vision-language navigation.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
WorldVLN formulates aerial VLN as a world-action problem where an autoregressive model predicts short-horizon world-state transitions from instruction-conditioned contexts and directly decodes these predictions into executable waypoint actions. Closed-loop feedback incorporates new observations after execution, and Action-aware GRPO optimizes the model by considering the downstream effects of waypoint choices during rollout.
What carries the argument
Short-horizon latent world-state transition prediction in an adapted autoregressive video backbone, decoded directly to waypoint actions with closed-loop observation updates.
If this is right
- Outperforms vision-language-action baselines by 12% or more in success rate on outdoor and indoor public benchmarks.
- Shows larger gains on difficult navigation cases.
- Achieves zero-shot transfer to real drone deployment without additional engineering.
- Provides a route for other spatial action tasks using prediction-driven world models.
Where Pith is reading between the lines
- This approach might generalize to ground-based or underwater navigation by swapping the video backbone for domain-appropriate priors.
- Future work could test if longer horizons or multi-step planning within the autoregressive structure further improves complex multi-room instructions.
- Integrating this with other sensor types like depth or LiDAR could enhance robustness in low-visibility conditions.
Load-bearing premise
The assumption that predictions of short-horizon world changes from the video model can be decoded into waypoint actions whose real-world consequences improve navigation success without needing full video generation or heavy customization.
What would settle it
Deploying the trained model on the benchmark environments and measuring that the success rate does not exceed that of standard vision-language-action methods, or observing frequent failures to follow instructions due to inaccurate transition predictions.
Figures
read the original abstract
Aerial vision-language navigation (VLN) requires agents to follow natural-language instructions through closed-loop perception and action in 3D environments. We argue that aerial VLN can be formulated as a prediction-driven world-action problem: the agent should anticipate latent world evolution and act according to the predicted consequences. To this end, we propose WorldVLN, the first autoregressive world action model for aerial VLN. Unlike full-sequence video-generation world models that generate an entire visual clip, WorldVLN adapts a latent autoregressive video backbone to predict short-horizon world-state transitions and directly decodes them into executable waypoint actions. After each action segment is executed, newly received observations are encoded back into the autoregressive context, enabling closed-loop world-action prediction. We further introduce a two-stage training framework that first grounds the video prior in instruction-conditioned navigation dynamics and then develops Action-aware GRPO, the first reinforcement learning method tailored to autoregressive WAMs, to optimize waypoint decisions through their downstream rollout consequences. On public outdoor and indoor benchmarks, WorldVLN consistently outperforms existing Vision-Language-Action baselines with 12\%+ success-rate gains and larger advantages on challenging cases. It further transfers zero-shot to real drone deployment, suggesting that the proposed WorldVLN offers a promising route for spatial action tasks. Demos and code are available at https://embodiedcity.github.io/WorldVLN/.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces WorldVLN as the first autoregressive world action model for aerial VLN. It adapts a latent autoregressive video backbone to predict short-horizon world-state transitions and directly decodes them into executable waypoint actions, with closed-loop feedback by re-encoding new observations after each action segment. A two-stage training framework first grounds the video prior in instruction-conditioned navigation dynamics and then applies the novel Action-aware GRPO reinforcement learning method to optimize waypoint decisions via downstream rollout consequences. The approach is reported to deliver 12%+ success-rate gains over Vision-Language-Action baselines on public outdoor and indoor benchmarks (with larger margins on challenging cases) and zero-shot transfer to real drone deployment.
Significance. If the central claims are substantiated by rigorous experiments, the work would be significant for embodied AI and robotics. It offers a predictive, short-horizon alternative to full-sequence video world models and standard VLA policies for 3D aerial navigation, while introducing Action-aware GRPO as a tailored RL objective for autoregressive world-action models. The reported zero-shot sim-to-real transfer on a physical drone would strengthen the case for practical utility in spatial reasoning tasks.
major comments (2)
- [§3.2] §3.2 (latent-to-waypoint decoder): The central claim requires that short-horizon latent transitions from the adapted video backbone can be directly decoded into reliable 3D waypoint actions whose rollout improves success rate. No equation or architectural diagram specifies the decoder, and no ablation isolates its contribution from the subsequent Action-aware GRPO stage. Without this, it remains possible that gains are driven primarily by the RL optimization rather than the world-model component.
- [§5] §5 (Experiments): The reported 12%+ success-rate gains and zero-shot real-drone transfer are load-bearing for the paper's conclusions, yet the manuscript supplies no error bars, statistical tests, baseline implementation details, or ablations comparing short-horizon versus full-sequence prediction. This absence prevents assessment of whether the data support the claim that short-horizon latent predictions suffice for reliable aerial waypoint control.
minor comments (2)
- [Abstract] The abstract states that 'Demos and code are available' but the manuscript should include the exact repository link or a persistent identifier in the main text for reproducibility.
- [§4] Notation for Action-aware GRPO could be clarified with a short algorithm box or pseudocode to distinguish it from standard GRPO.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback, which has identified important opportunities to improve the clarity and rigor of our manuscript. We address each major comment below and commit to revisions that directly respond to the concerns raised.
read point-by-point responses
-
Referee: [§3.2] §3.2 (latent-to-waypoint decoder): The central claim requires that short-horizon latent transitions from the adapted video backbone can be directly decoded into reliable 3D waypoint actions whose rollout improves success rate. No equation or architectural diagram specifies the decoder, and no ablation isolates its contribution from the subsequent Action-aware GRPO stage. Without this, it remains possible that gains are driven primarily by the RL optimization rather than the world-model component.
Authors: We agree that the description of the latent-to-waypoint decoder in §3.2 would benefit from greater precision. The current manuscript text introduces the decoder but does not provide an explicit equation or dedicated diagram. In the revised version we will add both: a mathematical formulation of the decoder mapping and an updated Figure 3 that visually separates the autoregressive latent prediction from the waypoint decoding step. We also acknowledge the absence of a targeted ablation isolating the decoder. While the two-stage training framework first grounds the video prior before applying GRPO, we did not report a controlled comparison that removes only the decoder. We will add this ablation in the revision to quantify the independent contribution of the world-model decoding to the reported gains. revision: yes
-
Referee: [§5] §5 (Experiments): The reported 12%+ success-rate gains and zero-shot real-drone transfer are load-bearing for the paper's conclusions, yet the manuscript supplies no error bars, statistical tests, baseline implementation details, or ablations comparing short-horizon versus full-sequence prediction. This absence prevents assessment of whether the data support the claim that short-horizon latent predictions suffice for reliable aerial waypoint control.
Authors: We appreciate the referee's call for stronger statistical reporting and additional controls. The manuscript currently omits error bars, formal statistical tests, and the requested short-horizon versus full-sequence ablation. In the revised manuscript we will report standard deviations over multiple random seeds, include p-values from appropriate statistical tests, expand the baseline implementation details (including all hyperparameters and training protocols) in the main text or supplementary material, and add a new ablation that directly compares short-horizon latent prediction against full-sequence prediction under otherwise identical conditions. These additions will allow readers to evaluate whether the short-horizon design is sufficient for closed-loop aerial control. revision: yes
Circularity Check
No significant circularity; derivation relies on empirical validation of novel components
full rationale
The paper adapts an existing latent autoregressive video backbone to predict short-horizon world-state transitions, decodes them into waypoint actions, and introduces Action-aware GRPO for optimization via downstream rollouts. No equations or steps in the provided description reduce these predictions or gains to fitted inputs by construction, nor do they rely on self-citations that bear the central load without independent verification. The claims are supported by benchmark results and real-world transfer, making the chain self-contained against external evaluation rather than tautological.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Peter Anderson, Qi Wu, Damien Teney, Jake Bruce, Mark Johnson, Niko Sünderhauf, Ian Reid, Stephen Gould, and Anton Van Den Hengel. Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 3674–3683, 2018
work page 2018
-
[2]
Amir Bar, Gaoyue Zhou, Danny Tran, Trevor Darrell, and Yann LeCun. Navigation world models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 15791–15801, 2025
work page 2025
-
[3]
RT-1: Robotics Transformer for Real-World Control at Scale
Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, Julian Ibarz, Brian Ichter, Alex Irpan, Tomas Jackson, Sally Jesmonth, Nikhil J Joshi, Ryan Julian, Dmitry Kalashnikov, Yuheng Kuang, Isabel Leal, Kuang-Huei Lee, Sergey Levine, Yao Lu, Utsav Malla, ...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[4]
Q-transformer: Scalable offline reinforcement learning via autoregressive q-functions
Yevgen Chebotar, Quan Vuong, Karol Hausman, Fei Xia, Yao Lu, Alex Irpan, Aviral Kumar, Tianhe Yu, Alexander Herzog, Karl Pertsch, et al. Q-transformer: Scalable offline reinforcement learning via autoregressive q-functions. InConference on Robot Learning, pages 3909–3928. PMLR, 2023
work page 2023
-
[5]
Large Video Planner Enables Generalizable Robot Control
Boyuan Chen, Tianyuan Zhang, Haoran Geng, Kiwhan Song, Caiyi Zhang, Peihao Li, William T. Freeman, Jitendra Malik, Pieter Abbeel, Russ Tedrake, Vincent Sitzmann, and Yilun Du. Large video planner enables generalizable robot control, 2025. URL https://arxiv.org/abs/ 2512.15840
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[6]
VideoCrafter1: Open Diffusion Models for High-Quality Video Generation
Haoxin Chen, Menghan Xia, Yingqing He, Yong Zhang, Xiaodong Cun, Shaoshu Yang, Jinbo Xing, Yaofang Liu, Qifeng Chen, Xintao Wang, Chao Weng, and Ying Shan. Videocrafter1: Open diffusion models for high-quality video generation, 2023. URL https://arxiv.org/ abs/2310.19512
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[7]
Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Misha Laskin, Pieter Abbeel, Aravind Srinivas, and Igor Mordatch. Decision transformer: Reinforcement learning via sequence modeling.Advances in neural information processing systems, 34:15084–15097, 2021. 10
work page 2021
-
[8]
Jordan Crivelli-Decker, Alex Clarke, Seongmin A Park, Derek J Huffman, Erie D Boorman, and Charan Ranganath. Goal-oriented representations in the human hippocampus during planning and navigation.Nature communications, 14(1):2946, 2023
work page 2023
-
[9]
Diffusion models for smarter uavs: Decision-making and modeling, 2025
Yousef Emami, Hao Zhou, Luis Almeida, and Kai Li. Diffusion models for smarter uavs: Decision-making and modeling, 2025. URLhttps://arxiv.org/abs/2501.05819
-
[10]
Russell A Epstein, Eva Zita Patai, Joshua B Julian, and Hugo J Spiers. The cognitive map in humans: spatial navigation and beyond.Nature neuroscience, 20(11):1504–1513, 2017
work page 2017
-
[11]
U˘gur M Erdem and Michael Hasselmo. A goal-directed spatial navigation model using forward trajectory planning based on grid cells.European Journal of Neuroscience, 35(6):916–931, 2012
work page 2012
-
[12]
André O Françani and Marcos ROA Maximo. Transformer-based model for monocular visual odometry: a video understanding approach.IEEE Access, 13:13959–13971, 2025
work page 2025
-
[13]
Openfly: A comprehensive platform for aerial vision-language navigation, 2026
Yunpeng Gao, Chenhui Li, Zhongrui You, Junli Liu, Zhen Li, Pengan Chen, Qizhi Chen, Zhonghan Tang, Liansheng Wang, Penghui Yang, Yiwen Tang, Yuhang Tang, Shuai Liang, Songyi Zhu, Ziqin Xiong, Yifei Su, Xinyi Ye, Jianan Li, Yan Ding, Dong Wang, Xuelong Li, Zhigang Wang, and Bin Zhao. Openfly: A comprehensive platform for aerial vision-language navigation, ...
-
[14]
Mastering Diverse Domains through World Models
Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse domains through world models.arXiv preprint arXiv:2301.04104, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[15]
Imagen Video: High Definition Video Generation with Diffusion Models
Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P. Kingma, Ben Poole, Mohammad Norouzi, David J. Fleet, and Tim Salimans. Imagen video: High definition video generation with diffusion models, 2022. URL https: //arxiv.org/abs/2210.02303
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[16]
Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David Fleet. Video diffusion models. In S. Koyejo, S. Mohamed, A. Agar- wal, D. Belgrave, K. Cho, and A. Oh, editors,Advances in Neural Informa- tion Processing Systems, volume 35, pages 8633–8646. Curran Associates, Inc.,
-
[17]
URL https://proceedings.neurips.cc/paper_files/paper/2022/file/ 39235c56aef13fb05a6adc95eb9d8d66-Paper-Conference.pdf
work page 2022
-
[18]
Physical Intelligence, Bo Ai, Ali Amin, Raichelle Aniceto, Ashwin Balakrishna, Greg Balke, Kevin Black, George Bokinsky, Shihao Cao, Thomas Charbonnier, Vedant Choudhary, Foster Collins, Ken Conley, Grace Connors, James Darpinian, Karan Dhabalia, Maitrayee Dhaka, Jared DiCarlo, Danny Driess, Michael Equi, Adnan Esmail, Yunhao Fang, Chelsea Finn, Catherine...
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[19]
How Far is Video Generation from World Model: A Physical Law Perspective
Bingyi Kang, Yang Yue, Rui Lu, Zhijie Lin, Yang Zhao, Kaixin Wang, Gao Huang, and Jiashi Feng. How far is video generation from world model: A physical law perspective, 2025. URL https://arxiv.org/abs/2411.02385
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[20]
OpenVLA: An Open-Source Vision-Language-Action Model
Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, Quan Vuong, Thomas Kollar, 11 Benjamin Burchfiel, Russ Tedrake, Dorsa Sadigh, Sergey Levine, Percy Liang, and Chelsea Finn. Openvla: An open-source vision-language-action model, 2024. URL https://arxiv. org/a...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[21]
Cosmos policy: Fine-tuning video models for visuomotor control and planning, 2026
Moo Jin Kim, Yihuai Gao, Tsung-Yi Lin, Yen-Chen Lin, Yunhao Ge, Grace Lam, Percy Liang, Shuran Song, Ming-Yu Liu, Chelsea Finn, and Jinwei Gu. Cosmos policy: Fine-tuning video models for visuomotor control and planning, 2026. URL https://arxiv.org/abs/2601. 16163
work page 2026
-
[22]
Beyond the nav- graph: Vision-and-language navigation in continuous environments
Jacob Krantz, Erik Wijmans, Arjun Majumdar, Dhruv Batra, and Stefan Lee. Beyond the nav- graph: Vision-and-language navigation in continuous environments. InEuropean Conference on Computer Vision, pages 104–120. Springer, 2020
work page 2020
-
[23]
Aviral Kumar, Aurick Zhou, George Tucker, and Sergey Levine. Conservative q-learning for offline reinforcement learning.Advances in neural information processing systems, 33: 1179–1191, 2020
work page 2020
-
[24]
Causal world modeling for robot control,
Lin Li, Qihang Zhang, Yiming Luo, Shuai Yang, Ruilin Wang, Fei Han, Mingrui Yu, Zelin Gao, Nan Xue, Xing Zhu, Yujun Shen, and Yinghao Xu. Causal world modeling for robot control,
-
[25]
URLhttps://arxiv.org/abs/2601.21998
work page internal anchor Pith review Pith/arXiv arXiv
-
[26]
Jinlai Liu, Jian Han, Bin Yan, Hui Wu, Fengda Zhu, Xing Wang, Yi Jiang, Bingyue Peng, and Zehuan Yuan. Infinitystar: Unified spacetime autoregressive modeling for visual generation. arXiv preprint arXiv:2511.04675, 2025
-
[27]
Indooruav: Benchmarking vision-language uav navigation in continuous indoor environments
Xu Liu, Yu Liu, Hanshuo Qiu, Yang Qirong, and Zhouhui Lian. Indooruav: Benchmarking vision-language uav navigation in continuous indoor environments. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 23864–23872, 2026
work page 2026
-
[28]
What Matters in Learning from Offline Human Demonstrations for Robot Manipulation
Ajay Mandlekar, Danfei Xu, Josiah Wong, Soroush Nasiriany, Chen Wang, Rohun Kulkarni, Li Fei-Fei, Silvio Savarese, Yuke Zhu, and Roberto Martín-Martín. What matters in learning from offline human demonstrations for robot manipulation.arXiv preprint arXiv:2108.03298, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[29]
mimic-video: Video-Action Models for Generalizable Robot Control Beyond VLAs
Jonas Pai, Liam Achenbach, Victoriano Montesinos, Benedek Forrai, Oier Mees, and Elvis Nava. mimic-video: Video-action models for generalizable robot control beyond vlas, 2025. URLhttps://arxiv.org/abs/2512.15692
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[30]
Vln-r1: Vision-language navigation via reinforcement fine-tuning, 2025
Zhangyang Qi, Zhixiong Zhang, Yizhou Yu, Jiaqi Wang, and Hengshuang Zhao. Vln-r1: Vision-language navigation via reinforcement fine-tuning, 2025. URL https://arxiv.org/ abs/2506.17221
-
[31]
Worldsimbench: Towards video generation models as world simulators, 2024
Yiran Qin, Zhelun Shi, Jiwen Yu, Xijun Wang, Enshen Zhou, Lijun Li, Zhenfei Yin, Xihui Liu, Lu Sheng, Jing Shao, Lei Bai, Wanli Ouyang, and Ruimao Zhang. Worldsimbench: Towards video generation models as world simulators, 2024. URL https://arxiv.org/abs/2410. 18072
work page 2024
-
[32]
A reduction of imitation learning and structured prediction to no-regret online learning
Stéphane Ross, Geoffrey Gordon, and Drew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. InProceedings of the fourteenth interna- tional conference on artificial intelligence and statistics, pages 627–635. JMLR Workshop and Conference Proceedings, 2011
work page 2011
-
[33]
Lm-nav: Robotic navigation with large pre-trained models of language, vision, and action
Dhruv Shah, Bła˙zej Osi ´nski, Sergey Levine, et al. Lm-nav: Robotic navigation with large pre-trained models of language, vision, and action. InConference on robot learning, pages 492–504. pmlr, 2023
work page 2023
-
[34]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[35]
Towards long-horizon vision-language navigation: Platform, benchmark and method
Xinshuai Song, Weixing Chen, Yang Liu, Weikai Chen, Guanbin Li, and Liang Lin. Towards long-horizon vision-language navigation: Platform, benchmark and method. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12078–12088, 2025. 12
work page 2025
-
[36]
Rich Sutton. The bitter lesson. https://www.cs.utexas.edu/~eunsol/courses/data/ bitter_lesson.pdf, 2019. Accessed: 2026-05-07
work page 2019
-
[38]
Wan: Open and Advanced Large-Scale Video Generative Models
Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[39]
Xiangyu Wang, Donglin Yang, Yue Liao, Wenhao Zheng, Bin Dai, Hongsheng Li, Si Liu, et al. Uav-flow colosseo: A real-world benchmark for flying-on-a-word uav imitation learning.arXiv preprint arXiv:2505.15725, 2025
-
[40]
Vla-adapter: An effective paradigm for tiny-scale vision-language-action model
Yihao Wang, Pengxiang Ding, Lingxiao Li, Can Cui, Zirui Ge, Xinyang Tong, Wenxuan Song, Han Zhao, Wei Zhao, Pengxu Hou, et al. Vla-adapter: An effective paradigm for tiny-scale vision-language-action model. InProceedings of the AAAI conference on artificial intelligence, volume 40, pages 18638–18646, 2026
work page 2026
-
[41]
HunyuanVideo 1.5 Technical Report
Bing Wu, Chang Zou, Changlin Li, Duojun Huang, Fang Yang, Hao Tan, Jack Peng, Jianbing Wu, Jiangfeng Xiong, Jie Jiang, Linus, Patrol, Peizhen Zhang, Peng Chen, Penghao Zhao, Qi Tian, Songtao Liu, Weijie Kong, Weiyan Wang, Xiao He, Xin Li, Xinchi Deng, Xuefei Zhe, Yang Li, Yanxin Long, Yuanbo Peng, Yue Wu, Yuhong Liu, Zhenyu Wang, Zuozhuo Dai, Bo Peng, Coo...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[42]
Day- dreamer: World models for physical robot learning
Philipp Wu, Alejandro Escontrela, Danijar Hafner, Pieter Abbeel, and Ken Goldberg. Day- dreamer: World models for physical robot learning. InConference on robot learning, pages 2226–2240. PMLR, 2023
work page 2023
-
[43]
Wansen Wu, Tao Chang, Xinmeng Li, Quanjun Yin, and Yue Hu. Vision-language navigation: a survey and taxonomy.Neural Computing and Applications, 36(7):3291–3316, 2024
work page 2024
-
[44]
Omninav: A unified framework for prospective exploration and visual-language navigation, 2026
Xinda Xue, Junjun Hu, Minghua Luo, Shichao Xie, Jintao Chen, Zixun Xie, Kuichen Quan, Wei Guo, Mu Xu, and Zedong Chu. Omninav: A unified framework for prospective exploration and visual-language navigation, 2026. URLhttps://arxiv.org/abs/2509.25687
-
[45]
CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer
Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, Da Yin, Yuxuan Zhang, Weihan Wang, Yean Cheng, Bin Xu, Xiaotao Gu, Yuxiao Dong, and Jie Tang. Cogvideox: Text-to-video diffusion models with an expert transformer, 2025. URLhttps://arxiv.org/abs/2408.06072
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[46]
World action models are zero-shot policies,
Seonghyeon Ye, Yunhao Ge, Kaiyuan Zheng, Shenyuan Gao, Sihyun Yu, George Kurian, Suneel Indupuru, You Liang Tan, Chuning Zhu, Jiannan Xiang, Ayaan Malik, Kyungmin Lee, William Liang, Nadun Ranawaka, Jiasheng Gu, Yinzhen Xu, Guanzhi Wang, Fengyuan Hu, Avnish Narayan, Johan Bjorck, Jing Wang, Gwanghyun Kim, Dantong Niu, Ruijie Zheng, Yuqi Xie, Jimmy Wu, Qi ...
-
[47]
URLhttps://arxiv.org/abs/2602.15922
work page internal anchor Pith review Pith/arXiv arXiv
-
[48]
NaVid: Video-based VLM Plans the Next Step for Vision-and-Language Navigation
Jiazhao Zhang, Kunyu Wang, Rongtao Xu, Gengze Zhou, Yicong Hong, Xiaomeng Fang, Qi Wu, Zhizheng Zhang, and He Wang. Navid: Video-based vlm plans the next step for vision-and-language navigation.arXiv preprint arXiv:2402.15852, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[49]
Epona: Autoregressive diffusion world model for autonomous driving
Kaiwen Zhang, Zhenyu Tang, Xiaotao Hu, Xingang Pan, Xiaoyang Guo, Yuan Liu, Jingwei Huang, Li Yuan, Qian Zhang, Xiao-Xiao Long, et al. Epona: Autoregressive diffusion world model for autonomous driving. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 27220–27230, 2025
work page 2025
-
[50]
Weichen Zhang, Chen Gao, Shiquan Yu, Ruiying Peng, Baining Zhao, Qian Zhang, Jinqiang Cui, Xinlei Chen, and Yong Li. Citynavagent: Aerial vision-and-language navigation with hierarchical semantic planning and global memory. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 31292–31309, 2025
work page 2025
-
[51]
Weichen Zhang, Peizhi Tang, Xin Zeng, Fanhang Man, Shiquan Yu, Zichao Dai, Baining Zhao, Hongjin Chen, Yu Shang, Wei Wu, et al. Aerial world model for long-horizon visual generation and navigation in 3d space.arXiv preprint arXiv:2512.21887, 2025
-
[52]
Navgpt: Explicit reasoning in vision-and-language navigation with large language models
Gengze Zhou, Yicong Hong, and Qi Wu. Navgpt: Explicit reasoning in vision-and-language navigation with large language models. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 7641–7649, 2024. 14 A Appendix A.1 Broader Impacts and Responsible Deployment WorldVLN may benefit UA V-based embodied navigation applications such a...
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.