pith. sign in

arxiv: 2606.29908 · v1 · pith:XTCZBNN4new · submitted 2026-06-29 · 💻 cs.RO · cs.AI

Pondering the Way: Spatial-perceiving World Action Model for Embodied Navigation

Pith reviewed 2026-06-30 05:53 UTC · model grok-4.3

classification 💻 cs.RO cs.AI
keywords embodied navigationworld modelsvisual navigationjoint observation-action generationtrajectory planningzero-shot generalizationspatial perceptionaction refinement
0
0 comments X

The pith

SWAM jointly generates RGB-D sequences and action trajectories from start and goal images in a single inference pass for embodied navigation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SWAM as a shift from verification-centric world models that separate goal intent from trajectory planning. Instead it uses one forward pass to produce both intermediate visual observations and the matching actions. Training incorporates depth pseudo-labels to build spatial awareness even though only RGB is needed at test time. A refinement module and scale regularization loss are added to keep predicted motion aligned with the generated visuals. Results show gains in success rate, path accuracy, speed, and generalization to new scenes over prior two-stage methods.

Core claim

SWAM performs single-pass joint generation of intermediate RGB-D sequences and corresponding action trajectories given only start and goal RGB observations. This task-centric design replaces candidate sampling and verification loops with direct synthesis that enforces goal consistency and spatial feasibility. Depth pseudo-labels are used only during training; inference remains monocular. A visual-guided action refinement module together with trajectory-scale regularization further aligns motion predictions with visual cues and stabilizes outputs across distance scales.

What carries the argument

The Spatial-perceiving World Action Model (SWAM) that performs single-pass joint observation-action generation to enforce goal-consistent and spatially feasible trajectories.

If this is right

  • Removes dependence on candidate sampling and separate verification steps, lowering computational cost.
  • Produces trajectories whose visual predictions stay aligned with executed actions by construction.
  • Internalizes spatial priors from depth labels so that monocular input suffices at inference.
  • Yields higher success rates and shorter accurate paths than two-stage planners.
  • Maintains performance when transferred zero-shot to environments never seen in training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The single-pass structure may limit error accumulation that occurs when planning and verification are chained across many steps.
  • Monocular inference at test time could simplify hardware requirements for robots that lack depth sensors.
  • The same joint-generation pattern could be tested on other control problems where visual prediction and action choice currently drift apart.
  • If the regularization terms prove general, similar scale-aware losses might stabilize long-horizon predictions in related sequence models.

Load-bearing premise

Jointly predicting future RGB-D images and actions in one model will automatically keep the actions feasible inside the predicted visual space without introducing new mismatches.

What would settle it

Execute the generated action sequence in simulation while feeding the predicted RGB-D frames as observations; if the agent deviates from the intended goal or the observed depths contradict the predicted sequence, the joint-consistency claim fails.

Figures

Figures reproduced from arXiv: 2606.29908 by Bing Wang, Daqi Liu, Fangzhen Li, Guang Chen, Haiguang Wang, Haiyang Sun, Hangjun Ye, Hong Chen, Hongwei Xie, Longfei Yan, Tianhao Lu, Yihua Tan, Zehan Zhang.

Figure 1
Figure 1. Figure 1: Comparison of different paradigms for visual navigation. Left: Nomad predicts trajectories from historical RGB frames and a goal image. Middle: NWM generates future RGB videos conditioned on trajectories. Right (Ours): Our model jointly predicts trajectories and RGBD paths from the current RGB observation and the goal image in one forward pass, enabling efficient planning. Abstract. Existing world model-ba… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of SWAM. We extend the pretrained Diffusion Transformer (DiT) of CogVideoX [46] by conditioning it on start and goal frame latents, and fine-tune it to jointly produce intermediate RGB-D frame latents and associated action tokens. DepthAnything V3 [21] is used to predict pseudo-depth maps for the start and goal frames. The Visual-Guided Action Refinement (VGAR) module further refines the predicted… view at source ↗
Figure 3
Figure 3. Figure 3: Success@(τ ) curves across thresholds on RECON. SWAM achieves substantially higher success rates, especially at strict thresholds (τ = 0.25 and τ = 0.5). GT NWM Ours [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Robustness to path length. SWAM maintains accurate trajectory scaling across distances, while NWM exhibits obvious trajectory scale errors. indicates that SWAM enhances final-positional precision, which closely aligns with navigation planning. Video Generation Quality. Tab. 2 summarizes video generation metrics. SWAM achieves state-of-the-art results across all datasets. We attribute the im￾proved visual q… view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative comparison between SWAM and NWM. SWAM generates trajec￾tories closer to ground truth with consistent video generation, while NWM struggles in such challenging scenarios. tory scaling across distances (e.g., ∼10 m and 1.5 m in two examples), whereas NWM exhibits scale drift that grows with the prediction horizon, indicating that SWAM effectively mitigates scale accumulation errors over varying d… view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative results of typical failure scenarios. ing, leads to erroneous avoidance behaviors and accumulated trajectory drift. Another limitation arises in sharp-turn scenarios, where the current planar dis￾placement based action representation fails to capture the rapid heading adjust￾ments required for high-curvature paths. Because the framework does not ex￾plicitly encode orientation dynamics, the agen… view at source ↗
Figure 1
Figure 1. Figure 1: Visualization of zero-shot generalization results on the HuRoN dataset. to measure the perceptual similarity and structural consistency of the gener￾ated depth maps. As shown in [PITH_FULL_IMAGE:figures/full_fig_p021_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: More Visualization results of SWAM [PITH_FULL_IMAGE:figures/full_fig_p022_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Visualization of long-horizon prediction results of SWAM (64 frames). temporal consistency. Note that a long sequence does not necessarily correspond to a long physical travel distance. This difference is related to factors such as the agent’s movement speed and the data collection efficiency. SWAM, as a high￾level planner rather than a local executor, is more concerned with enabling path planning over var… view at source ↗
read the original abstract

Existing world model-based planners for visual navigation typically follow a verification-centric paradigm, decoupling goal intent from trajectory synthesis. This approach suffers from candidate dependence, heavy computational overhead, and inconsistencies between sampled actions and predicted visuals. To address these issues, we propose SWAM (Spatial-perceiving World Action Model), a task-centric joint observation-action generation framework. Given start and goal RGB observations, SWAM performs single-pass inference to simultaneously generate intermediate RGB-D sequences and corresponding action trajectories, promoting goal-consistent trajectory generation and improved spatial feasibility. While SWAM leverages depth pseudo-labels during training to internalize spatial priors, it requires only monocular RGB input at inference time. We further introduce a visual-guided action refinement module and a trajectory-scale regularization loss to enforce fine-grained alignment between motion and visual cues while stabilizing predictions across varying distances. Extensive experiments show that SWAM significantly outperforms state-of-the-art two-stage planners in success rate, trajectory accuracy, and inference efficiency, while demonstrating robust zero-shot generalization to unseen environments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The manuscript proposes SWAM (Spatial-perceiving World Action Model), a joint observation-action generation framework for embodied visual navigation. Unlike verification-centric two-stage planners that decouple goal intent from trajectory synthesis, SWAM performs single-pass inference to generate intermediate RGB-D sequences and corresponding action trajectories from start and goal RGB observations. Depth pseudo-labels are used only during training to internalize spatial priors, while inference uses monocular RGB. Additional components include a visual-guided action refinement module and a trajectory-scale regularization loss. The central claim is that this architecture yields higher success rates, better trajectory accuracy, improved inference efficiency, and stronger zero-shot generalization to unseen environments compared to prior two-stage methods.

Significance. If the performance claims hold under rigorous evaluation, the work could be significant for embodied navigation and world-model planning. The shift to single-pass joint generation directly targets documented inconsistencies between predicted visuals and executed actions, while the training-time use of depth pseudo-labels followed by RGB-only inference is a practical design that could reduce overhead. Demonstrating robust generalization without explicit depth at test time would be a useful contribution if supported by appropriate controls.

major comments (2)
  1. [Abstract] Abstract: The central claims of significant outperformance in success rate, trajectory accuracy, and inference efficiency, plus robust zero-shot generalization, are asserted without any quantitative results, baselines, error bars, dataset descriptions, or ablation evidence appearing in the provided manuscript text. This absence prevents evaluation of whether the joint-generation approach actually delivers the claimed consistency and efficiency gains.
  2. [Abstract] The manuscript text supplies no equations, architectural diagrams, loss formulations, or training details that would allow assessment of how the visual-guided refinement module and trajectory-scale regularization loss enforce alignment between motion and visual cues. Without these, it is impossible to verify that the single-pass design avoids introducing new mismatches between predicted visuals and actions.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback. We address the major comments point by point below and agree that revisions are needed to strengthen the abstract and ensure technical details are verifiable.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claims of significant outperformance in success rate, trajectory accuracy, and inference efficiency, plus robust zero-shot generalization, are asserted without any quantitative results, baselines, error bars, dataset descriptions, or ablation evidence appearing in the provided manuscript text. This absence prevents evaluation of whether the joint-generation approach actually delivers the claimed consistency and efficiency gains.

    Authors: We agree that the abstract, as currently written, asserts performance claims without quantitative support or dataset details. The full manuscript contains these results in the Experiments section. We will revise the abstract to include key quantitative highlights (e.g., success rate deltas, inference speedups, and dataset names) to allow immediate evaluation of the claims. revision: yes

  2. Referee: [Abstract] The manuscript text supplies no equations, architectural diagrams, loss formulations, or training details that would allow assessment of how the visual-guided refinement module and trajectory-scale regularization loss enforce alignment between motion and visual cues. Without these, it is impossible to verify that the single-pass design avoids introducing new mismatches between predicted visuals and actions.

    Authors: We acknowledge that the provided manuscript text lacks equations, diagrams, loss formulations, and training details for the visual-guided action refinement module and trajectory-scale regularization loss. We will revise the manuscript to include these elements (equations for the losses, an architecture diagram, and explicit description of how they enforce visual-action alignment) so that the single-pass design can be properly assessed. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The provided abstract and description present SWAM as an empirical architecture: a single-pass joint RGB-D and action generator trained with depth pseudo-labels, refinement module, and regularization loss. No equations, parameter-fitting steps, uniqueness theorems, or self-citations are referenced that would reduce claimed performance gains to quantities defined by the inputs themselves. Experimental claims of improved success rate and zero-shot generalization rest on external benchmarks rather than any internal definitional loop, rendering the method self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no equations, training details, or explicit assumptions; cannot enumerate free parameters, axioms, or invented entities.

pith-pipeline@v0.9.1-grok · 5744 in / 1020 out tokens · 22356 ms · 2026-06-30T05:53:40.835288+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

56 extracted references · 18 canonical work pages · 14 internal anchors

  1. [1]

    In: Proceedings of the Eighth international joint conference on Artificial intelligence-Volume 2

    Allen, J.F., Koomen, J.A.: Planning using a temporal world model. In: Proceedings of the Eighth international joint conference on Artificial intelligence-Volume 2. pp. 741–747 (1983)

  2. [2]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Bar, A., Zhou, G., Tran, D., Darrell, T., LeCun, Y.: Navigation world models. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 15791–15801 (2025)

  3. [3]

    $\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

    Black, K., Brown, N., Driess, D., Esmail, A., Equi, M., Finn, C., Fusai, N., Groom, L., Hausman, K., Ichter, B., et al.:π0: A vision-language-action flow model for general robot control. arXiv preprint arXiv:2410.24164 (2024)

  4. [4]

    Journal of intelligent and robotic systems53(3), 263–296 (2008)

    Bonin-Font, F., Ortiz, A., Oliver, G.: Visual navigation for mobile robots: A survey. Journal of intelligent and robotic systems53(3), 263–296 (2008)

  5. [5]

    WorldVLA: Towards Autoregressive Action World Model

    Cen, J., Yu, C., Yuan, H., Jiang, Y., Huang, S., Guo, J., Li, X., Song, Y., Luo, H., Wang, F., et al.: Worldvla: Towards autoregressive action world model. arXiv preprint arXiv:2506.21539 (2025)

  6. [6]

    In: Robotics: Science and Systems (RSS) (2023)

    Chi, C., Feng, S., Du, Y., Xu, Z., Cousineau, E., Burchfiel, B., Tedrake, R.: Diffu- sion policy: Visuomotor policy learning via action diffusion. In: Robotics: Science and Systems (RSS) (2023)

  7. [7]

    Visual Foresight: Model-Based Deep Reinforcement Learning for Vision-Based Robotic Control

    Ebert, F., Finn, C., Dasari, S., Xie, A., Lee, A., Levine, S.: Visual foresight: Model- based deep reinforcement learning for vision-based robotic control. arXiv preprint arXiv:1812.00568 (2018)

  8. [8]

    In: 2017 IEEE international conference on robotics and automation (ICRA)

    Finn, C., Levine, S.: Deep visual foresight for planning robot motion. In: 2017 IEEE international conference on robotics and automation (ICRA). pp. 2786–2793. IEEE (2017)

  9. [9]

    DreamDojo: A Generalist Robot World Model from Large-Scale Human Videos

    Gao,S.,Liang,W.,Zheng,K.,Malik,A.,Ye,S.,Yu,S.,Tseng,W.C.,Dong,Y.,Mo, K., Lin, C.H., et al.: Dreamdojo: A generalist robot world model from large-scale human videos. arXiv preprint arXiv:2602.06949 (2026)

  10. [10]

    Advances in Neural Information Processing Systems37, 112386–112410 (2024)

    Guo, Y., Hu, Y., Zhang, J., Wang, Y.J., Chen, X., Lu, C., Chen, J.: Prediction with action: Visual policy learning via joint denoising process. Advances in Neural Information Processing Systems37, 112386–112410 (2024)

  11. [11]

    In: Proceedings of the IEEE conference on computer vision and pattern recognition

    Gupta, S., Davidson, J., Levine, S., Sukthankar, R., Malik, J.: Cognitive mapping and planning for visual navigation. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 2616–2625 (2017)

  12. [12]

    World Models

    Ha, D., Schmidhuber, J.: World models. arXiv preprint arXiv:1803.101222(3), 440 (2018)

  13. [13]

    In: International Conference on Learning Representations (ICLR) (2020)

    Hafner, D., Lillicrap, T., Ba, J., Norouzi, M.: Dream to control: Learning behaviors by latent imagination. In: International Conference on Learning Representations (ICLR) (2020)

  14. [14]

    In: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

    Hao, S., Gu, Y., Ma, H., Hong, J., Wang, Z., Wang, D., Hu, Z.: Reasoning with lan- guage model is planning with world model. In: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. pp. 8154–8173 (2023)

  15. [15]

    IEEE Robotics and Automation Letters9(1), 49–56 (2023)

    Hirose, N., Shah, D., Sridhar, A., Levine, S.: Sacson: Scalable autonomous control for social navigation. IEEE Robotics and Automation Letters9(1), 49–56 (2023)

  16. [16]

    IEEE Robotics and Automation Letters4(4), 3184–3191 (2019)

    Hirose, N., Xia, F., Martín-Martín, R., Sadeghian, A., Savarese, S.: Deep visual mpc-policy learning for navigation. IEEE Robotics and Automation Letters4(4), 3184–3191 (2019)

  17. [17]

    Advances in neural information processing systems33, 6840–6851 (2020) Spatial-perceiving World Action Model for Navigation 17

    Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. Advances in neural information processing systems33, 6840–6851 (2020) Spatial-perceiving World Action Model for Navigation 17

  18. [18]

    IEEE Robotics and Automation Letters 7(4), 11807–11814 (2022)

    Karnan, H., Nair, A., Xiao, X., Warnell, G., Pirk, S., Toshev, A., Hart, J., Biswas, J., Stone, P.: Socially compliant navigation dataset (scand): A large-scale dataset of demonstrations for social navigation. IEEE Robotics and Automation Letters 7(4), 11807–11814 (2022)

  19. [19]

    OpenVLA: An Open-Source Vision-Language-Action Model

    Kim, M.J., Pertsch, K., Karamcheti, S., Xiao, T., Balakrishna, A., Nair, S., Rafailov, R., Foster, E., Lam, G., Sanketi, P., et al.: Openvla: An open-source vision-language-action model. arXiv preprint arXiv:2406.09246 (2024)

  20. [20]

    A Comprehensive Survey on World Models for Embodied AI

    Li, X., He, X., Zhang, L., Wu, M., Li, X., Liu, Y.: A comprehensive survey on world models for embodied ai. arXiv preprint arXiv:2510.16732 (2025)

  21. [21]

    Depth Anything 3: Recovering the Visual Space from Any Views

    Lin, H., Chen, S., Liew, J., Chen, D.Y., Li, Z., Shi, G., Feng, J., Kang, B.: Depth anything 3: Recovering the visual space from any views. arXiv preprint arXiv:2511.10647 (2025)

  22. [22]

    In: The Eleventh International Conference on Learning Rep- resentations

    Lipman, Y., Chen, R.T., Ben-Hamu, H., Nickel, M., Le, M.: Flow matching for generative modeling. In: The Eleventh International Conference on Learning Rep- resentations

  23. [23]

    Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models

    Liu, Y., Zhang, K., Li, Y., Yan, Z., Gao, C., Chen, R., Yuan, Z., Huang, Y., Sun, H., Gao, J., et al.: Sora: A review on background, technology, limitations, and opportunities of large vision models. arXiv preprint arXiv:2402.17177 (2024)

  24. [24]

    In: International Conference on Learning Representations (2017)

    Mirowski, P., Pascanu, R., Viola, F., Soyer, H., Ballard, A., Banino, A., Denil, M., Goroshin, R., Sifre, L., Kavukcuoglu, K., et al.: Learning to navigate in complex environments. In: International Conference on Learning Representations (2017)

  25. [25]

    In: International Conference on Learning Representations (2018)

    Parisotto, E., Salakhutdinov, R.: Neural map: Structured memory for deep re- inforcement learning. In: International Conference on Learning Representations (2018)

  26. [26]

    Peebles,W.,Xie,S.:Scalablediffusionmodelswithtransformers.In:Proceedingsof the IEEE/CVF international conference on computer vision. pp. 4195–4205 (2023)

  27. [27]

    In: International Conference on Learning Representations (2018)

    Savinov, N., Dosovitskiy, A., Koltun, V.: Semi-parametric topological memory for navigation. In: International Conference on Learning Representations (2018)

  28. [28]

    In: Confer- ence on Robot Learning

    Shah, D., Equi, M.R., Osiński, B., Xia, F., Ichter, B., Levine, S.: Navigation with large language models: Semantic guesswork as a heuristic for planning. In: Confer- ence on Robot Learning. pp. 2683–2699. PMLR (2023)

  29. [29]

    In: 2021 IEEE International Conference on Robotics and Automation (ICRA)

    Shah, D., Eysenbach, B., Kahn, G., Rhinehart, N., Levine, S.: Ving: Learning open-world navigation with visual goals. In: 2021 IEEE International Conference on Robotics and Automation (ICRA). pp. 13215–13222. IEEE (2021)

  30. [30]

    In: Conference on Robot Learning

    Shah, D., Eysenbach, B., Rhinehart, N., Levine, S.: Rapid exploration for open- world navigation with latent goal models. In: Conference on Robot Learning. pp. 674–684. PMLR (2022)

  31. [31]

    In: 2023 IEEE International Conference on Robotics and Automation (ICRA)

    Shah, D., Sridhar, A., Bhorkar, A., Hirose, N., Levine, S.: Gnm: A general naviga- tion model to drive any robot. In: 2023 IEEE International Conference on Robotics and Automation (ICRA). pp. 7226–7233. IEEE (2023)

  32. [32]

    In: Conference on Robot Learning

    Shah,D.,Sridhar,A.,Dashora,N.,Stachowicz,K.,Black,K.,Hirose,N.,Levine,S.: Vint: A foundation model for visual navigation. In: Conference on Robot Learning. pp. 711–733. PMLR (2023)

  33. [33]

    In: The Thirty-ninth Annual Conference on Neural Information Processing Systems(NeurIPS2025) (2025)

    Shen, Y., Wei, F., Du, Z., Liang, Y., Lu, Y., Yang, J., Zheng, N., Guo, B.: Videovla: Video generators can be generalizable robot manipulators. In: The Thirty-ninth Annual Conference on Neural Information Processing Systems(NeurIPS2025) (2025)

  34. [34]

    Song,J.,Meng,C.,Ermon,S.:Denoisingdiffusionimplicitmodels.In:International Conference on Learning Representations 18 Chen et al

  35. [35]

    In: Interna- tional Conference on Learning Representations

    Song, Y., Sohl-Dickstein, J., Kingma, D.P., Kumar, A., Ermon, S., Poole, B.: Score- based generative modeling through stochastic differential equations. In: Interna- tional Conference on Learning Representations

  36. [36]

    In: 2024 IEEE International Conference on Robotics and Automation (ICRA)

    Sridhar, A., Shah, D., Glossop, C., Levine, S.: Nomad: Goal masked diffusion poli- cies for navigation and exploration. In: 2024 IEEE International Conference on Robotics and Automation (ICRA). pp. 63–70. IEEE (2024)

  37. [37]

    Octo: An Open-Source Generalist Robot Policy

    Team, O.M., Ghosh, D., Walke, H., Pertsch, K., Black, K., Mees, O., Dasari, S., Hejna, J., Kreiman, T., Xu, C., et al.: Octo: An open-source generalist robot policy. arXiv preprint arXiv:2405.12213 (2024)

  38. [38]

    In: 2022 International Conference on Robotics and Automation (ICRA)

    Triest, S., Sivaprakasam, M., Wang, S.J., Wang, W., Johnson, A.M., Scherer, S.: Tartandrive: A large-scale dataset for learning off-road dynamics models. In: 2022 International Conference on Robotics and Automation (ICRA). pp. 2546–2552. IEEE (2022)

  39. [39]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Wan, T., Wang, A., Ai, B., Wen, B., Mao, C., Xie, C.W., Chen, D., Yu, F., Zhao, H., Yang, J., et al.: Wan: Open and advanced large-scale video generative models. arXiv preprint arXiv:2503.20314 (2025)

  40. [40]

    arXiv preprint arXiv:2503.15875 (2025)

    Wang, H., Liu, D., Xie, H., Liu, H., Ma, E., Yu, K., Wang, L., Wang, B.: Mila: Multi-view intensive-fidelity long-term video generation world model for au- tonomous driving. arXiv preprint arXiv:2503.15875 (2025)

  41. [41]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Wu, Y., Karunratanakul, K., Luo, Z., Tang, S.: Uniphys: Unified planner and con- troller with diffusion for flexible physics-based character control. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 13214–13224 (2025)

  42. [42]

    arXiv preprint arXiv:2601.04453 (2026)

    Xiong, Z., Ye, X., Yaman, B., Cheng, S., Lu, Y., Luo, J., Jacobs, N., Ren, L.: UniDrive-WM: Unified understanding, planning and generation world model for autonomous driving. arXiv preprint arXiv:2601.04453 (2026)

  43. [43]

    In: Belgrave, D., Zhang, C., Lin, H., Pascanu, R., Koniusz, P., Ghassemi, M., Chen, N

    Yang, J., Chitta, K., Gao, S., Chen, L., Shao, Y., Jia, X., Li, H., Geiger, A., Yue, X., Chen, L.: Resim: Reliable world simulation for autonomous driving. In: Belgrave, D., Zhang, C., Lin, H., Pascanu, R., Koniusz, P., Ghassemi, M., Chen, N. (eds.) Advances in Neural Information Processing Systems. vol. 38, pp. 167710–167741 (2025)

  44. [44]

    arXiv preprint arXiv:2512.16023 (2025)

    Yang, L., Bai, Y., Eskandar, G., Shen, F., Altillawi, M., Chen, D., Liu, Z., Valada, A.: Covar: Co-generation of video and action for robotic manipulation via multi- modal diffusion. arXiv preprint arXiv:2512.16023 (2025)

  45. [45]

    Yang,Y., Liu,J., Zhang,Z., Zhou, S.,Tan, R.,Yang,J., Du,Y., Gan,C.: Mindjour- ney:Test-timescalingwithworldmodelsforspatialreasoning.In:TheThirty-ninth Annual Conference on Neural Information Processing Systems (2025)

  46. [46]

    In: The Thirteenth International Conference on Learning Represen- tations (2025)

    Yang, Z., Teng, J., Zheng, W., Ding, M., Huang, S., Xu, J., Yang, Y., Hong, W., Zhang, X., Feng, G., Yin, D., Yuxuan.Zhang, Wang, W., Cheng, Y., Xu, B., Gu, X., Dong, Y., Tang, J.: Cogvideox: Text-to-video diffusion models with an expert transformer. In: The Thirteenth International Conference on Learning Represen- tations (2025)

  47. [47]

    World Action Models are Zero-shot Policies

    Ye, S., Ge, Y., Zheng, K., Gao, S., Yu, S., Kurian, G., Indupuru, S., Tan, Y.L., Zhu, C., Xiang, J., et al.: World action models are zero-shot policies. arXiv preprint arXiv:2602.15922 (2026)

  48. [48]

    In: RSS 2025 Workshop: Mobile Manipulation: Emerging Opportunities&Con- temporary Challenges (2025) Spatial-perceiving World Action Model for Navigation 19

    Yin, T., Mei, Z., Sun, T., Zha, L., Zhou, E., Bao, J., Yamane, M., Sho, O., Majum- dar, A.: Womap: World models for embodied open-vocabulary object localization. In: RSS 2025 Workshop: Mobile Manipulation: Emerging Opportunities&Con- temporary Challenges (2025) Spatial-perceiving World Action Model for Navigation 19

  49. [49]

    World-in-world: World models in a closed-loop world.arXiv preprint arXiv:2510.18135, 2025

    Zhang, J., Jiang, M., Dai, N., Lu, T., Uzunoglu, A., Zhang, S., Wei, Y., Wang, J., Patel, V.M., Liang, P.P., et al.: World-in-world: World models in a closed-loop world. arXiv preprint arXiv:2510.18135 (2025)

  50. [50]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Zhang, K., Tang, Z., Hu, X., Pan, X., Guo, X., Liu, Y., Huang, J., Yuan, L., Zhang, Q., Long, X.X., et al.: Epona: Autoregressive diffusion world model for autonomous driving. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 27220–27230 (2025)

  51. [51]

    Engineering Applications of Artificial Intelligence114, 105036 (2022)

    Zhang, T., Hu, X., Xiao, J., Zhang, G.: A survey of visual navigation: From geome- try to embodied ai. Engineering Applications of Artificial Intelligence114, 105036 (2022)

  52. [52]

    Open-Sora: Democratizing Efficient Video Production for All

    Zheng, Z., Peng, X., Yang, T., Shen, C., Li, S., Liu, H., Zhou, Y., Li, T., You, Y.: Open-sora: Democratizing efficient video production for all. arXiv preprint arXiv:2412.20404 (2024)

  53. [53]

    DINO-WM: World Models on Pre-trained Visual Features enable Zero-shot Planning

    Zhou, G., Pan, H., LeCun, Y., Pinto, L.: Dino-wm: World models on pre-trained visual features enable zero-shot planning. arXiv preprint arXiv:2411.04983 (2024)

  54. [54]

    In: The Thirty-ninth Annual Conference on Neural Information Processing Systems (2025)

    Zhou, S., Du, Y., Yang, Y., Han, L., Chen, P., Yeung, D.Y., Gan, C.: Learning 3d persistent embodied world models. In: The Thirty-ninth Annual Conference on Neural Information Processing Systems (2025)

  55. [55]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Zhu, H., Wang, Y., Zhou, J., Chang, W., Zhou, Y., Li, Z., Chen, J., Shen, C., Pang, J., He, T.: Aether: Geometric-aware unified world modeling. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 8535–8546 (2025)

  56. [56]

    In: 2017 IEEE international conference on robotics and automation (ICRA)

    Zhu, Y., Mottaghi, R., Kolve, E., Lim, J.J., Gupta, A., Fei-Fei, L., Farhadi, A.: Target-driven visual navigation in indoor scenes using deep reinforcement learning. In: 2017 IEEE international conference on robotics and automation (ICRA). pp. 3357–3364. IEEE (2017) Pondering the Way: Spatial-perceiving World Action Model for Embodied Navigation Supplemen...