AstraNav-World: World Model for Foresight Control and Consistency

Jintao Chen , Junjun Hu , Haochen Bai , Minghua Luo , Xinda Xue , Botao Ren , Chengyu Bai , Shichao Xie

show 6 more authors

Ziyi Chen Fei Liu Zedong Chu Xiaolong Wu Mu Xu Shanghang Zhang

Authors on Pith no claims yet

classification 💻 cs.CV

keywords astranav-worldembodiedforesightmodelnavigationreal-worldvisualworld

0 comments

read the original abstract

Embodied navigation in open, dynamic environments demands accurate foresight of how the world will evolve and how actions will unfold over time. We propose AstraNav-World, an end-to-end world model that jointly reasons about future visual states and action sequences within a unified probabilistic framework. Our framework integrates a diffusion-based video generator with a vision-language policy, enabling synchronized rollouts where predicted scenes and planned actions are updated simultaneously. Training optimizes two complementary objectives: generating action-conditioned multi-step visual predictions and deriving trajectories conditioned on those predicted visuals. This bidirectional constraint makes visual predictions executable and keeps decisions grounded in physically consistent, task-relevant futures, mitigating cumulative errors common in decoupled "envision-then-plan" pipelines. Experiments across diverse embodied navigation benchmarks show improved trajectory accuracy and higher success rates. Ablations confirm the necessity of tight vision-action coupling and unified training, with either branch removal degrading both prediction quality and policy reliability. In real-world testing, AstraNav-World demonstrated exceptional zero-shot capabilities, adapting to previously unseen scenarios without any real-world fine-tuning. These results suggest that AstraNav-World captures transferable spatial understanding and planning-relevant navigation dynamics, rather than merely overfitting to simulation-specific data distribution. Overall, by unifying foresight vision and control within a single generative model, we move closer to reliable, interpretable, and general-purpose embodied agents that operate robustly in open-ended real-world settings.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 6 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

PathPainter: Transferring the Generalization Ability of Image Generation Models to Embodied Navigation
cs.RO 2026-05 unverdicted novelty 6.0

PathPainter transfers image generation models to embodied navigation by generating traversability masks from BEV images and language instructions while using cross-view localization to reduce odometry drift.
AsyncShield: A Plug-and-Play Edge Adapter for Asynchronous Cloud-based VLA Navigation
cs.RO 2026-04 unverdicted novelty 6.0

AsyncShield restores VLA geometric intent from latency via kinematic pose mapping and uses PPO-Lagrangian to balance tracking with LiDAR safety constraints in a plug-and-play module.
What Limits Vision-and-Language Navigation ?
cs.RO 2026-05 unverdicted novelty 5.0

StereoNav reaches new benchmark highs on R2R-CE and RxR-CE and improves real-robot reliability by supplying persistent target-location priors and stereo-derived geometry that stay stable under lighting changes and blur.
Explore Like Humans: Autonomous Exploration with Online SG-Memo Construction for Embodied Agents
cs.CV 2026-04 unverdicted novelty 5.0

ABot-Explorer unifies online exploration and hierarchical semantic memory construction via VLM-distilled navigational affordances for improved embodied navigation efficiency.
Dual-Anchoring: Addressing State Drift in Vision-Language Navigation
cs.CV 2026-04 unverdicted novelty 5.0

Dual-Anchoring adds explicit progress tokens and retrospective landmark verification to VLN agents, cutting state drift and lifting success rate 15.2% overall with 24.7% gains on long trajectories.
Matrix-Game 3.0: Real-Time and Streaming Interactive World Model with Long-Horizon Memory
cs.CV 2026-04 unverdicted novelty 4.0

Matrix-Game 3.0 delivers 720p real-time video generation at 40 FPS with minute-scale memory consistency by combining residual self-correction training, camera-aware memory injection, and DMD-based autoregressive disti...