pith. sign in

arxiv: 2512.21714 · v2 · submitted 2025-12-25 · 💻 cs.CV

AstraNav-World: World Model for Foresight Control and Consistency

Pith reviewed 2026-05-16 19:21 UTC · model grok-4.3

classification 💻 cs.CV
keywords embodied navigationworld modeldiffusion video generationvision-language policyforesight controlaction-conditioned predictionbidirectional consistencyzero-shot transfer
0
0 comments X

The pith

A unified world model couples future scene generation with action planning to produce executable navigation trajectories.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces AstraNav-World as an end-to-end world model for embodied navigation in open environments. It combines a diffusion-based video generator with a vision-language policy so that predicted scenes and planned actions update simultaneously under bidirectional constraints. Training jointly optimizes action-conditioned visual forecasts and visual-conditioned trajectory derivation. This coupling is intended to reduce cumulative errors that arise when prediction and planning run separately. The authors report higher trajectory accuracy, improved success rates on benchmarks, and zero-shot transfer to real-world settings without fine-tuning.

Core claim

AstraNav-World jointly reasons about future visual states and action sequences in a single probabilistic framework. It integrates a diffusion-based video generator with a vision-language policy to produce synchronized rollouts where action-conditioned multi-step visual predictions are generated alongside trajectories derived from those predictions. The bidirectional constraint is optimized directly, making the visual forecasts executable by the policy and keeping actions grounded in physically consistent futures.

What carries the argument

Bidirectional constraint enforced by joint optimization of action-conditioned video diffusion and visual-conditioned trajectory derivation within one generative model.

If this is right

  • Trajectory accuracy and task success rates increase on diverse embodied navigation benchmarks.
  • Cumulative drift from separate prediction-then-plan pipelines is reduced.
  • Zero-shot adaptation occurs in previously unseen real-world scenarios without additional training.
  • Visual predictions become directly executable by the policy rather than requiring post-hoc correction.
  • The model captures transferable spatial dynamics instead of simulation-specific patterns.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same joint-training pattern could be applied to other embodied tasks such as object manipulation where visual foresight must align with motor commands.
  • Removing the need for separate consistency modules may simplify deployment pipelines for real robots.
  • Testing the model on longer horizons or in environments with moving obstacles would reveal whether the bidirectional coupling scales beyond the reported benchmarks.

Load-bearing premise

Joint optimization of visual prediction and action derivation will automatically enforce physical consistency and executability without extra constraints or filtering.

What would settle it

Real-world trials in which the generated visual sequences diverge from actual camera observations or the derived actions produce collisions at rates comparable to decoupled baselines would show the bidirectional constraint does not deliver the claimed consistency.

read the original abstract

Embodied navigation in open, dynamic environments demands accurate foresight of how the world will evolve and how actions will unfold over time. We propose AstraNav-World, an end-to-end world model that jointly reasons about future visual states and action sequences within a unified probabilistic framework. Our framework integrates a diffusion-based video generator with a vision-language policy, enabling synchronized rollouts where predicted scenes and planned actions are updated simultaneously. Training optimizes two complementary objectives: generating action-conditioned multi-step visual predictions and deriving trajectories conditioned on those predicted visuals. This bidirectional constraint makes visual predictions executable and keeps decisions grounded in physically consistent, task-relevant futures, mitigating cumulative errors common in decoupled "envision-then-plan" pipelines. Experiments across diverse embodied navigation benchmarks show improved trajectory accuracy and higher success rates. Ablations confirm the necessity of tight vision-action coupling and unified training, with either branch removal degrading both prediction quality and policy reliability. In real-world testing, AstraNav-World demonstrated exceptional zero-shot capabilities, adapting to previously unseen scenarios without any real-world fine-tuning. These results suggest that AstraNav-World captures transferable spatial understanding and planning-relevant navigation dynamics, rather than merely overfitting to simulation-specific data distribution. Overall, by unifying foresight vision and control within a single generative model, we move closer to reliable, interpretable, and general-purpose embodied agents that operate robustly in open-ended real-world settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper introduces AstraNav-World, an end-to-end world model for embodied navigation in dynamic environments. It integrates a diffusion-based video generator with a vision-language policy in a unified probabilistic framework to produce synchronized rollouts of future visual states and action sequences. Training uses two complementary objectives—action-conditioned visual prediction and visual-conditioned trajectory derivation—to enforce bidirectional constraints that aim to make predictions executable and decisions physically consistent. The manuscript reports improved trajectory accuracy and success rates on benchmarks, ablations showing the necessity of tight coupling, and strong zero-shot real-world adaptation without fine-tuning.

Significance. If the empirical claims hold with rigorous metrics, the work would advance embodied AI by showing that joint optimization of foresight vision and control can reduce cumulative errors compared to decoupled pipelines, yielding more reliable and transferable navigation agents. The zero-shot real-world results, if substantiated, would indicate capture of general spatial dynamics rather than simulation overfitting, with potential implications for general-purpose embodied agents.

major comments (3)
  1. [Abstract / Experiments] Abstract and Experiments: The central empirical claims of 'improved trajectory accuracy and higher success rates' and 'ablations confirm the necessity' are presented without any quantitative metrics, baseline comparisons, error bars, statistical tests, or data-exclusion rules. This absence makes it impossible to assess the magnitude or reliability of the reported gains from bidirectional coupling.
  2. [Method / §3] Method / §3: The assertion that 'bidirectional constraint makes visual predictions executable and keeps decisions grounded in physically consistent futures' rests on joint optimization alone, without explicit physics-informed losses, consistency regularizers, or direct validation metrics (e.g., collision rates, object permanence, or momentum conservation in predicted scenes). Diffusion models are known to produce visually plausible but dynamically invalid sequences, so this assumption requires concrete evidence to support the load-bearing claim of automatic physical consistency.
  3. [Ablations] Ablations: The statement that 'either branch removal degrading both prediction quality and policy reliability' is asserted without details on how the coupling was ablated (e.g., separate training schedules, loss weights, or architectural changes) or the resulting quantitative drops in the same metrics used for the main results.
minor comments (1)
  1. [Abstract] Abstract: Phrases such as 'exceptional zero-shot capabilities' and 'robustly in open-ended real-world settings' are qualitative; replacing them with specific success rates, scenario descriptions, and comparison to baselines would improve precision.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. We address each major point below and will revise the manuscript to provide the requested quantitative details, methodological clarifications, and additional validation metrics.

read point-by-point responses
  1. Referee: [Abstract / Experiments] The central empirical claims of 'improved trajectory accuracy and higher success rates' and 'ablations confirm the necessity' are presented without any quantitative metrics, baseline comparisons, error bars, statistical tests, or data-exclusion rules.

    Authors: We acknowledge that the current version summarizes improvements qualitatively in the abstract and experiments without specific numerical values or statistical details. In the revised manuscript we will add quantitative results including exact trajectory accuracy and success rate numbers, comparisons against all relevant baselines, error bars from multiple random seeds, statistical significance tests, and explicit data exclusion criteria. revision: yes

  2. Referee: [Method / §3] The assertion that 'bidirectional constraint makes visual predictions executable and keeps decisions grounded in physically consistent futures' rests on joint optimization alone, without explicit physics-informed losses, consistency regularizers, or direct validation metrics (e.g., collision rates, object permanence, or momentum conservation in predicted scenes).

    Authors: We agree that diffusion models can generate visually plausible yet dynamically invalid sequences and that our current claim relies primarily on the bidirectional objectives. We will add direct validation metrics such as collision rates and object permanence scores on predicted rollouts in the revised experiments. We will also clarify that physical consistency is an emergent property of the joint training rather than the result of explicit physics losses, which we do not introduce. revision: partial

  3. Referee: [Ablations] The statement that 'either branch removal degrading both prediction quality and policy reliability' is asserted without details on how the coupling was ablated (e.g., separate training schedules, loss weights, or architectural changes) or the resulting quantitative drops in the same metrics used for the main results.

    Authors: We will expand the ablations section to describe the exact ablation protocols, including separate training schedules, loss weighting schemes, and architectural modifications. We will report the resulting quantitative drops using the identical metrics as the main results, with error bars and statistical comparisons. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical claims rest on benchmarks rather than definitional reduction

full rationale

The paper's core argument is that joint optimization of a diffusion video generator and vision-language policy via synchronized rollouts produces physically consistent futures. This is presented as an empirical outcome validated by trajectory accuracy, success rates, and ablations on coupling necessity. No equations, fitted parameters, or self-citations are shown to reduce the central claim to its inputs by construction. The bidirectional constraint is an architectural choice whose effectiveness is tested externally, not assumed tautologically. This is the expected low-circularity outcome for an empirical systems paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; therefore the exact free parameters, axioms, and invented entities cannot be enumerated. The framework appears to rest on standard diffusion training losses and VLM components without introducing new postulated entities.

pith-pipeline@v0.9.0 · 5587 in / 1218 out tokens · 56718 ms · 2026-05-16T19:21:37.684589+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 7 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Dual-Anchoring: Addressing State Drift in Vision-Language Navigation

    cs.CV 2026-04 unverdicted novelty 7.0

    Dual-Anchoring Framework mitigates progress drift via structured instruction tokens and memory drift via landmark-centric retrospective prediction, yielding 15.2% success rate gain and 24.7% on long trajectories.

  2. PathPainter: Transferring the Generalization Ability of Image Generation Models to Embodied Navigation

    cs.RO 2026-05 unverdicted novelty 6.0

    PathPainter transfers image generation models to embodied navigation by generating traversability masks from BEV images and language instructions while using cross-view localization to reduce odometry drift.

  3. AsyncShield: A Plug-and-Play Edge Adapter for Asynchronous Cloud-based VLA Navigation

    cs.RO 2026-04 unverdicted novelty 6.0

    AsyncShield restores VLA geometric intent from latency via kinematic pose mapping and uses PPO-Lagrangian to balance tracking with LiDAR safety constraints in a plug-and-play module.

  4. What Limits Vision-and-Language Navigation ?

    cs.RO 2026-05 unverdicted novelty 5.0

    StereoNav reaches new benchmark highs on R2R-CE and RxR-CE and improves real-robot reliability by supplying persistent target-location priors and stereo-derived geometry that stay stable under lighting changes and blur.

  5. Explore Like Humans: Autonomous Exploration with Online SG-Memo Construction for Embodied Agents

    cs.CV 2026-04 unverdicted novelty 5.0

    ABot-Explorer unifies online exploration and hierarchical semantic memory construction via VLM-distilled navigational affordances for improved embodied navigation efficiency.

  6. Dual-Anchoring: Addressing State Drift in Vision-Language Navigation

    cs.CV 2026-04 unverdicted novelty 5.0

    Dual-Anchoring adds explicit progress tokens and retrospective landmark verification to VLN agents, cutting state drift and lifting success rate 15.2% overall with 24.7% gains on long trajectories.

  7. Matrix-Game 3.0: Real-Time and Streaming Interactive World Model with Long-Horizon Memory

    cs.CV 2026-04 unverdicted novelty 4.0

    Matrix-Game 3.0 delivers 720p real-time video generation at 40 FPS with minute-scale memory consistency by combining residual self-correction training, camera-aware memory injection, and DMD-based autoregressive disti...

Reference graph

Works this paper leans on

34 extracted references · 34 canonical work pages · cited by 6 Pith papers · 14 internal anchors

  1. [1]

    World Simulation with Video Foundation Models for Physical AI

    Arslan Ali, Junjie Bai, Maciej Bala, Yogesh Balaji, Aaron Blakeman, Tiffany Cai, Jiaxin Cao, Tianshi Cao, Elizabeth Cha, Yu-Wei Chao, et al. World simulation with video foundation models for physical ai.arXiv preprint arXiv:2511.00062,

  2. [2]

    1st place solutions for rxr-habitat vision-and-language navigation competition (cvpr 2022)

    Dong An, Zun Wang, Yangguang Li, Yi Wang, Yicong Hong, Yan Huang, Liang Wang, and Jing Shao. 1st place solutions for rxr-habitat vision-and-language navigation competition (cvpr 2022).arXiv preprint arXiv:2206.11610,

  3. [3]

    On Evaluation of Embodied Navigation Agents

    Peter Anderson, Angel Chang, Devendra Singh Chaplot, Alexey Dosovitskiy, Saurabh Gupta, Vladlen Koltun, Jana Kosecka, Jitendra Malik, Roozbeh Mottaghi, Manolis Savva, et al. On evaluation of embodied navigation agents.arXiv preprint arXiv:1807.06757, 2018a. Peter Anderson, Qi Wu, Damien Teney, Jake Bruce, Mark Johnson, Niko Sünderhauf, Ian Reid, Stephen G...

  4. [4]

    WorldVLA: Towards Autoregressive Action World Model

    Jun Cen, Chaohui Yu, Hangjie Yuan, Yuming Jiang, Siteng Huang, Jiayan Guo, Xin Li, Yibing Song, Hao Luo, Fan Wang, et al. Worldvla: Towards autoregressive action world model.arXiv preprint arXiv:2506.21539,

  5. [5]

    Matterport3D: Learning from RGB-D Data in Indoor Environments

    Angel Chang, Angela Dai, Thomas Funkhouser, Maciej Halber, Matthias Niessner, Manolis Savva, Shuran Song, Andy Zeng, and Yinda Zhang. Matterport3d: Learning from rgb-d data in indoor environments.arXiv preprint arXiv:1709.06158,

  6. [6]

    GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation

    Chi-Lam Cheang, Guangzeng Chen, Ya Jing, Tao Kong, Hang Li, Yifeng Li, Yuxiao Liu, Hongtao Wu, Jiafeng Xu, Yichu Yang, et al. Gr-2: A generative video-language-action model with web-scale knowledge for robot manipulation.arXiv preprint arXiv:2410.06158,

  7. [7]

    NaVILA: Legged Robot Vision-Language-Action Model for Naviga- tion

    An-Chieh Cheng, Yandong Ji, Zhaojing Yang, Zaitian Gongye, Xueyan Zou, Jan Kautz, Erdem Bıyık, Hongxu Yin, Sifei Liu, and Xiaolong Wang. Navila: Legged robot vision-language-action model for navigation.arXiv preprint arXiv:2412.04453,

  8. [8]

    Unimax: Fairer and more effective language sampling for large-scale multilingual pretraining

    Hyung Won Chung, Noah Constant, Xavier Garcia, Adam Roberts, Yi Tay, Sharan Narang, and Orhan Firat. Unimax: Fairer and more effective language sampling for large-scale multilingual pretraining.arXiv preprint arXiv:2304.09151,

  9. [9]

    Mineworld: a real-time and open-source interactive world model on minecraft.arXiv preprint arXiv:2504.08388, 2025

    Junliang Guo, Yang Ye, Tianyu He, Haoyu Wu, Yushu Jiang, Tim Pearce, and Jiang Bian. Mineworld: a real-time and open-source interactive world model on minecraft.arXiv preprint arXiv:2504.08388,

  10. [10]

    Matrix-game 2.0: An open-source real-time and streaming interactive world model

    Xianglong He, Chunli Peng, Zexiang Liu, Boyang Wang, Yifan Zhang, Qi Cui, Fei Kang, Biao Jiang, Mengyin An, Yangyang Ren, et al. Matrix-game 2.0: An open-source, real-time, and streaming interactive world model.arXiv preprint arXiv:2508.13009,

  11. [11]

    Ladi-wm: A latent diffusion-based world model for predictive manipulation.arXiv preprint arXiv:2505.11528,

    Yuhang Huang, JIazhao Zhang, Shilong Zou, XInwang Liu, Ruizhen Hu, and Kai Xu. Ladi-wm: A latent diffusion-based world model for predictive manipulation.arXiv preprint arXiv:2505.11528,

  12. [12]

    Room-across-room: Multilingual vision- and-language navigation with dense spatiotemporal ground- ing.arXiv preprint arXiv:2010.07954, 2020

    Alexander Ku, Peter Anderson, Roma Patel, Eugene Ie, and Jason Baldridge. Room-across-room: Multilingual vision-and-language navigation with dense spatiotemporal grounding.arXiv preprint arXiv:2010.07954,

  13. [13]

    Flow Matching for Generative Modeling

    Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747,

  14. [14]

    Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

    Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. arXiv preprint arXiv:2209.03003,

  15. [15]

    Instructnav: Zero-shot system for generic instruction navigation in unexplored environment.arXiv preprint arXiv:2406.04882, 2024

    Yuxing Long, Wenzhe Cai, Hongcheng Wang, Guanqi Zhan, and Hao Dong. Instructnav: Zero-shot system for generic instruction navigation in unexplored environment.arXiv preprint arXiv:2406.04882,

  16. [16]

    Dywa: Dynamics-adaptive world action model for generalizable non- prehensile manipulation,

    Jiangran Lyu, Ziming Li, Xuesong Shi, Chaoyi Xu, Yizhou Wang, and He Wang. Dywa: Dynamics-adaptive world action model for generalizable non-prehensile manipulation.arXiv preprint arXiv:2503.16806,

  17. [17]

    GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

    J Bjorck Nvidia, Fernando Castaneda, N Cherniadev, X Da, R Ding, L Fan, Y Fang, D Fox, F Hu, S Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734,

  18. [18]

    Habitat-Matterport 3D Dataset (HM3D): 1000 Large-scale 3D Environments for Embodied AI

    Santhosh K Ramakrishnan, Aaron Gokaslan, Erik Wijmans, Oleksandr Maksymets, Alex Clegg, John Turner, Eric Undersander, WojciechGaluba,AndrewWestbury,AngelXChang,etal. Habitat-matterport3ddataset(hm3d): 1000large-scale3denvironments for embodied ai.arXiv preprint arXiv:2109.08238,

  19. [19]

    Language-aligned way- point (law) supervision for vision-and-language nav- igation in continuous environments

    Sonia Raychaudhuri, Saim Wani, Shivansh Patel, Unnat Jain, and Angel X Chang. Language-aligned waypoint (law) supervision for vision-and-language navigation in continuous environments.arXiv preprint arXiv:2109.15207,

  20. [20]

    GAIA-2: A Controllable Multi-View Generative World Model for Autonomous Driving

    Lloyd Russell, Anthony Hu, Lorenzo Bertoni, George Fedoseev, Jamie Shotton, Elahe Arani, and Gianluca Corrado. Gaia-2: A controllable multi-view generative world model for autonomous driving.arXiv preprint arXiv:2503.20523,

  21. [21]

    Mowm: Mixture-of-world-models for embodied planning via latent-to-pixel feature modulation.arXiv preprint arXiv:2509.21797,

    Yu Shang, Yangcheng Yu, Xin Zhang, Xin Jin, Haisheng Su, Wei Wu, and Yong Li. Mowm: Mixture-of-world-models for embodied planning via latent-to-pixel feature modulation.arXiv preprint arXiv:2509.21797,

  22. [22]

    Gigabrain-0: A world model-powered vision-language- action model.arXiv preprint arXiv:2510.19430, 2025

    GigaBrain Team, Angen Ye, Boyuan Wang, Chaojun Ni, Guan Huang, Guosheng Zhao, Haoyun Li, Jie Li, Jiagang Zhu, Lv Feng, et al. Gigabrain-0: A world model-powered vision-language-action model.arXiv preprint arXiv:2510.19430,

  23. [23]

    Yan: Foundational interactive video generation.arXiv preprint arXiv:2508.08601, 2025

    Yan Team. Yan: Foundational interactive video generation.arXiv preprint arXiv:2508.08601,

  24. [24]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314,

  25. [25]

    Dreamwalker: Mental planning for continuous vision-language navigation

    Hanqing Wang, Wei Liang, Luc Van Gool, and Wenguan Wang. Dreamwalker: Mental planning for continuous vision-language navigation. InProceedings of the IEEE/CVF international conference on computer vision, pages 10873–10883, 2023a. 14 ShaoanWang,JiazhaoZhang,MinghanLi,JiahangLiu,AnqiLi,KuiWu,FangweiZhong,JunzhiYu,ZhizhengZhang,andHeWang. Trackvla: Embodied ...

  26. [26]

    Streamvln: Streaming vision-and- language navigation via slowfast context modeling.arXiv preprint arXiv:2507.05240, 2025

    Meng Wei, Chenyang Wan, Xiqian Yu, Tai Wang, Yuqiang Yang, Xiaohan Mao, Chenming Zhu, Wenzhe Cai, Hanqing Wang, Yilun Chen, etal. Streamvln: Streamingvision-and-languagenavigationviaslowfastcontextmodeling.arXivpreprintarXiv:2507.05240,

  27. [27]

    Egotwin: Dreaming body and view in first person.arXiv preprint arXiv:2508.13013,

    Jingqiao Xiu, Fangzhou Hong, Yicong Li, Mengze Li, Wentao Wang, Sirui Han, Liang Pan, and Ziwei Liu. Egotwin: Dreaming body and view in first person.arXiv preprint arXiv:2508.13013,

  28. [28]

    Omninav: A unified framework for prospective exploration and visual-language navigation

    Xinda Xue, Junjun Hu, Minghua Luo, Xie Shichao, Jintao Chen, Zixun Xie, Quan Kuichen, Guo Wei, Mu Xu, and Zedong Chu. Omninav: A unified framework for prospective exploration and visual-language navigation.arXiv preprint arXiv:2509.25687,

  29. [29]

    Ce-nav: Flow-guided reinforcement refinement for cross-embodiment local navigation, 2025.https://arxiv.org/abs/2509.23203

    Kai Yang, Tianlin Zhang, Zhengbo Wang, Zedong Chu, Xiaolong Wu, Yang Cai, and Mu Xu. Ce-nav: Flow-guided reinforcement refinement for cross-embodiment local navigation, 2025.https://arxiv.org/abs/2509.23203. XuanYao,JunyuGao,andChangshengXu. Navmorph: Aself-evolvingworldmodelforvision-and-languagenavigationincontinuous environments.arXiv preprint arXiv:25...

  30. [30]

    Correctnav: Self-correction flywheel empowers vision-language-action navigation model.arXiv preprint arXiv:2508.10416, 2025

    Naoki Yokoyama, Sehoon Ha, Dhruv Batra, Jiuguang Wang, and Bernadette Bucher. Vlfm: Vision-language frontier maps for zero-shot semantic navigation. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 42–48. IEEE, 2024a. Naoki Yokoyama, Ram Ramrakhya, Abhishek Das, Dhruv Batra, and Sehoon Ha. Hm3d-ovon: A dataset and benchmark fo...

  31. [31]

    Uni-NaVid: A Video-based Vision-Language-Action Model for Unifying Embodied Navigation Tasks

    JiazhaoZhang,KunyuWang,ShaoanWang,MinghanLi,HaoranLiu,SonglinWei,ZhongyuanWang,ZhizhengZhang,andHeWang. Uni-navid: A video-based vision-language-action model for unifying embodied navigation tasks.arXiv preprint arXiv:2412.06224, 2024a. Jiazhao Zhang, Kunyu Wang, Rongtao Xu, Gengze Zhou, Yicong Hong, Xiaomeng Fang, Qi Wu, Zhizheng Zhang, and He Wang. Navi...

  32. [32]

    Flowvla: Visual chain of thought-based motion reason- ing for vision-language-action models.arXiv preprint arXiv:2508.18269,

    Zhide Zhong, Haodong Yan, Junfeng Li, Xiangchen Liu, Xin Gong, Tianran Zhang, Wenxuan Song, Jiayi Chen, Xinhu Zheng, Hesheng Wang, et al. Flowvla: Visual chain of thought-based motion reasoning for vision-language-action models.arXiv preprint arXiv:2508.18269,

  33. [33]

    Learning 3d persistent embodied world models.arXiv preprint arXiv:2505.05495,

    Siyuan Zhou, Yilun Du, Yuncong Yang, Lei Han, Peihao Chen, Dit-Yan Yeung, and Chuang Gan. Learning 3d persistent embodied world models.arXiv preprint arXiv:2505.05495,

  34. [34]

    Unified World Models: Coupling Video and Action Diffusion for Pretraining on Large Robotic Datasets

    15 Chuning Zhu, Raymond Yu, Siyuan Feng, Benjamin Burchfiel, Paarth Shah, and Abhishek Gupta. Unified world models: Coupling video and action diffusion for pretraining on large robotic datasets.arXiv preprint arXiv:2504.02792, 2025a. Ziyu Zhu, Xilin Wang, Yixuan Li, Zhuofan Zhang, Xiaojian Ma, Yixin Chen, Baoxiong Jia, Wei Liang, Qian Yu, Zhidong Deng, et...