pith. sign in

arxiv: 2605.24578 · v1 · pith:VSXCWOS7new · submitted 2026-05-23 · 💻 cs.CV

World Models as Group Actions

Pith reviewed 2026-06-30 13:15 UTC · model grok-4.3

classification 💻 cs.CV
keywords video world modelsgroup actionsaction consistencylatent regularizationSE(2) navigationworld model dynamicsGAC metricGAR metric
0
0 comments X

The pith

Video world models realize group actions on latent states when identity, inverse, and composition rules are enforced through regularization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that action faithfulness in video world models is best understood as the model realizing a group action on the state space, such as the special Euclidean group for navigation tasks. This provides a clear criterion beyond visual realism: the model must respect how actions compose, invert, and include an identity element. The authors operationalize this by adding latent-space regularization terms that enforce these properties using only synthesized supervision, without needing extra real data. Experiments on state-of-the-art models show gains in the proposed Group-Action Consistency and Group-Action Robustness metrics while keeping perceptual quality intact. A sympathetic reader would care because this turns world models from video generators into reliable simulators for planning and control in embodied settings.

Core claim

The central claim is that action-conditioned world modeling amounts to realizing a group action on the state space, and that enforcing the group properties of identity, inverse, and composition consistency through latent regularization with synthesized data improves structural correctness of the dynamics.

What carries the argument

Realizing group actions on the latent state space via consistency regularization on identity, inverse, and composition properties using synthesized supervision.

If this is right

  • Action sequences will compose correctly in the model's predictions.
  • Inverse actions will reliably return the state to its prior condition.
  • Identity actions will leave the predicted state unchanged.
  • Rollouts will exhibit greater stability over long horizons under the group constraints.
  • These improvements occur without any loss in visual fidelity of the generated videos.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Applying the same regularization to other domains like robotic manipulation could enforce their respective group structures such as SE(3).
  • Future work might derive the group structure automatically from data rather than assuming it a priori.
  • The metrics GAC and GAR could serve as training objectives in addition to evaluation.

Load-bearing premise

The assumption that embodied action dynamics exactly follow a group structure that can be captured and enforced in the existing latent space of video world models.

What would settle it

Observing that after applying the regularization, the model's predictions for composite actions deviate significantly from the direct prediction of the composed action on real video data.

Figures

Figures reproduced from arXiv: 2605.24578 by Fanqi Zhang, Guanbin Li, Weiming Zhang, Wei Zhang, Xiao Tan, Yipeng Qin, Zijie Wang.

Figure 1
Figure 1. Figure 1: Consequences of group-action-faithful dynamics. Identity, inverse consistency, and compatibility are observable conditions implied by the group-action formulation. Representative examples illustrate that GA regularization reduces zero-action drift, improves forward–inverse recovery, and yields more compatible outcomes under equivalent action decompositions. the underlying group action structure. Leveraging… view at source ↗
Figure 2
Figure 2. Figure 2: Preliminary evidence of action inconsistency. (a) Repeated stochastic rollouts from the same initial observation and identical action sequence exhibit divergent motion. (b) Controlled probes reveal violations of identity, inverse consistency, and compatibility, manifested as zero-action drift, incomplete inverse recovery, and endpoint mismatch under equivalent action decompositions. 2.2 Do Native Video Wor… view at source ↗
Figure 3
Figure 3. Figure 3: Group-Action Consistency (GAC) errors on RECON under different probe configurations. [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative results for Group-Action Consistency (GAC) and Group-Action Robustness [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Additional qualitative examples of weak rollout-level robustness in the baseline NWM. [PITH_FULL_IMAGE:figures/full_fig_p025_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Additional qualitative examples of local GAC failures in the baseline NWM. The figure [PITH_FULL_IMAGE:figures/full_fig_p026_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Additional qualitative examples of repeated-rollout robustness after GA training. Each [PITH_FULL_IMAGE:figures/full_fig_p027_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Additional qualitative examples of local group-action behavior after GA training. The [PITH_FULL_IMAGE:figures/full_fig_p028_8.png] view at source ↗
read the original abstract

Video world models have achieved strong visual realism, but this does not ensure that their dynamics are truly governed by actions. In this work, we argue that action faithfulness should be understood through the compositional structure of actions, which in many embodied settings follows a group structure (e.g., SE(2) for navigation). Based on this insight, we formalize action-conditioned world modeling as realizing a group action on the state space, providing a principled criterion for evaluating dynamics beyond visual quality. To operationalize this framework, we propose a unified approach that enforces identity, inverse, and composition consistency via latent-space regularization with synthesized supervision, avoiding additional data collection. We further introduce two metrics: Group-Action Consistency (GAC) and Group-Action Robustness (GAR), to evaluate structural correctness and rollout stability. Extensive experimental results show that our method consistently improves both GAC and GAR in state-of-the-art video world models without degrading perceptual quality.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims that action faithfulness in video world models should be understood via the compositional group structure of actions (e.g., SE(2) for navigation), formalizes action-conditioned world modeling as realizing a group action on the state space, proposes latent-space regularization with synthesized supervision to enforce identity/inverse/composition consistency, introduces GAC and GAR metrics, and reports that this yields consistent improvements in SOTA models without degrading perceptual quality.

Significance. If the improvements are shown to be independent of the regularization and the metrics provide an external criterion, the framework supplies a structured, group-theoretic basis for evaluating dynamics beyond visual fidelity; this would be relevant for embodied planning and control. The avoidance of new data collection via synthesized supervision is a practical positive if the synthesis is externally grounded.

major comments (3)
  1. [§3] §3 (synthesized supervision): the generation process for the supervision signals must be specified in detail; if supervision is produced from the base model's own predicted transitions or by applying assumed group operations directly in latent space without external grounding in real action trajectories, the regularization is self-referential and the claim of an independent principled criterion is undermined.
  2. [§4] §4 (GAC/GAR definitions): the metrics directly quantify the same identity, inverse, and composition properties that are enforced by the loss terms; this makes reported gains in GAC/GAR follow by construction from the added regularization rather than from improved alignment with embodied dynamics, weakening the central claim that the approach supplies an independent evaluation criterion.
  3. [§5] §5 (experiments): the manuscript must report concrete baselines, datasets, effect sizes, statistical controls, and ablation results; without these the abstract claim of 'consistent improvements' cannot be verified and may reflect post-hoc metric choices or insufficient controls.
minor comments (2)
  1. Notation for the group homomorphism and latent-space mappings should be introduced with explicit equations at first use to improve clarity.
  2. Ensure every acronym (GAC, GAR, SE(2)) is expanded on first occurrence in the body text.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. We address each major comment point by point below, indicating where revisions will be incorporated.

read point-by-point responses
  1. Referee: §3 (synthesized supervision): the generation process for the supervision signals must be specified in detail; if supervision is produced from the base model's own predicted transitions or by applying assumed group operations directly in latent space without external grounding in real action trajectories, the regularization is self-referential and the claim of an independent principled criterion is undermined.

    Authors: We agree that the generation process for synthesized supervision must be specified with full precision. The revised manuscript will include an expanded description in §3 detailing that supervision signals are generated by applying the known group operations (identity, inverse, composition) to ground-truth action labels drawn from the original dataset trajectories, rather than from model predictions or direct latent-space manipulations. This ensures external grounding in real action data and preserves the independence of the regularization criterion. revision: yes

  2. Referee: §4 (GAC/GAR definitions): the metrics directly quantify the same identity, inverse, and composition properties that are enforced by the loss terms; this makes reported gains in GAC/GAR follow by construction from the added regularization rather than from improved alignment with embodied dynamics, weakening the central claim that the approach supplies an independent evaluation criterion.

    Authors: The metrics and loss terms target the same group properties by design, as both operationalize the formalization in §2. However, GAC and GAR are evaluated on held-out test trajectories and multi-step rollouts, providing a measure of generalization to unseen data, whereas the regularization is applied only during training. We will revise §4 to explicitly distinguish these roles and clarify that the metrics serve as an external diagnostic of structural correctness beyond training objectives, thereby supporting their use as an independent evaluation criterion. revision: partial

  3. Referee: §5 (experiments): the manuscript must report concrete baselines, datasets, effect sizes, statistical controls, and ablation results; without these the abstract claim of 'consistent improvements' cannot be verified and may reflect post-hoc metric choices or insufficient controls.

    Authors: We acknowledge that the current experimental section requires additional concrete details for full verifiability. The revised manuscript will expand §5 to report specific model baselines, exact dataset names and splits, quantitative effect sizes with standard deviations across multiple runs, statistical significance tests, and further ablation studies isolating the contribution of each consistency term. revision: yes

Circularity Check

1 steps flagged

GAC/GAR metrics improve by construction from regularization enforcing the same group consistencies they measure

specific steps
  1. fitted input called prediction [Abstract]
    "we propose a unified approach that enforces identity, inverse, and composition consistency via latent-space regularization with synthesized supervision, avoiding additional data collection. We further introduce two metrics: Group-Action Consistency (GAC) and Group-Action Robustness (GAR), to evaluate structural correctness and rollout stability. Extensive experimental results show that our method consistently improves both GAC and GAR in state-of-the-art video world models without degrading perceptual quality."

    GAC and GAR are defined to measure precisely the identity/inverse/composition consistencies that the regularization loss directly optimizes; therefore reported metric gains follow by construction from applying the synthesized-supervision objective rather than from external verification that the latent dynamics obey an independent group structure.

full rationale

The paper defines action faithfulness as realizing a group action (identity, inverse, composition), operationalizes it by enforcing those exact properties via latent regularization with synthesized supervision, and then reports that the method 'consistently improves both GAC and GAR'. Because GAC and GAR are introduced specifically 'to evaluate structural correctness' of those same consistencies, the claimed gains on the metrics reduce to a direct consequence of the loss terms rather than independent evidence. The perceptual-quality claim is not circular, but the central structural improvement is. No self-citation chains or other patterns appear in the given text.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that action dynamics admit a group structure and that this structure can be imposed via latent regularization without new data; no free parameters or invented entities are mentioned in the abstract.

axioms (1)
  • domain assumption Actions in embodied settings follow a group structure (e.g., SE(2) for navigation)
    Invoked in the abstract as the basis for formalizing action-conditioned world modeling as realizing a group action.

pith-pipeline@v0.9.1-grok · 5697 in / 1312 out tokens · 31629 ms · 2026-06-30T13:15:03.868084+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. ATM: Action-Consistency Transfer Matrix for Diagnosing and Improving Latent World Models

    cs.CV 2026-06 unverdicted novelty 6.0

    ATM is a post-hoc probe-based transfer matrix that diagnoses action consistency in latent world models and serves as a training signal via AITS, enabling fast reliable ranking with claimed 100x speedup over CEM planne...

Reference graph

Works this paper leans on

88 extracted references · 45 canonical work pages · cited by 1 Pith paper · 18 internal anchors

  1. [1]

    Llm-planner: Few-shot grounded planning for embodied agents with large language models

    Chan Hee Song, Jiaman Wu, Clayton Washington, Brian M Sadler, Wei-Lun Chao, and Yu Su. Llm-planner: Few-shot grounded planning for embodied agents with large language models. In Proceedings of the IEEE/CVF international conference on computer vision, pages 2998–3009, 2023

  2. [2]

    Grhp: Graph- fused hierarchical planning for embodied long-horizon robotic task.Engineering Applications of Artificial Intelligence, 165:113413, 2026

    Xiaodong Li, Guohui Tian, Yongcheng Cui, Xuyang Shao, and Zhiwei Wang. Grhp: Graph- fused hierarchical planning for embodied long-horizon robotic task.Engineering Applications of Artificial Intelligence, 165:113413, 2026

  3. [3]

    Embodied Task Planning via Graph-Informed Action Generation with Large Language Models

    Xiang Li, Ning Yan, and Masood Mortazavi. Embodied task planning via graph-informed action generation with large language model.arXiv preprint arXiv:2601.21841, 2026

  4. [4]

    Embodied agent interface: Benchmarking llms for embodied decision making.Advances in Neural Information Processing Systems, 37:100428–100534, 2024

    Manling Li, Shiyu Zhao, Qineng Wang, Kangrui Wang, Yu Zhou, Sanjana Srivastava, Cem Gokmen, Tony Lee, Li E Li, Ruohan Zhang, et al. Embodied agent interface: Benchmarking llms for embodied decision making.Advances in Neural Information Processing Systems, 37:100428–100534, 2024

  5. [5]

    Vlabench: A large-scale benchmark for language- conditioned robotics manipulation with long-horizon reasoning tasks

    Shiduo Zhang, Zhe Xu, Peiju Liu, Xiaopeng Yu, Yuan Li, Qinghui Gao, Zhaoye Fei, Zhangyue Yin, Zuxuan Wu, Yu-Gang Jiang, et al. Vlabench: A large-scale benchmark for language- conditioned robotics manipulation with long-horizon reasoning tasks. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 11142–11152, 2025

  6. [6]

    Large model empowered embodied ai: A survey on decision-making and embodied learning

    Wenlong Liang, Rui Zhou, Yang Ma, Bing Zhang, Songlin Li, Yijia Liao, and Ping Kuang. Large model empowered embodied ai: A survey on decision-making and embodied learning. arXiv preprint arXiv:2508.10399, 2025

  7. [7]

    Meia: Multimodal embodied perception and interaction in unknown environments.arXiv preprint arXiv:2402.00290, 2024

    Yang Liu, Xinshuai Song, Kaixuan Jiang, Weixing Chen, Jingzhou Luo, Guanbin Li, and Liang Lin. Meia: Multimodal embodied perception and interaction in unknown environments.arXiv preprint arXiv:2402.00290, 2024

  8. [8]

    Aligning cyber space with physical world: A comprehensive survey on embodied ai

    Yang Liu, Weixing Chen, Yongjie Bai, Xiaodan Liang, Guanbin Li, Wen Gao, and Liang Lin. Aligning cyber space with physical world: A comprehensive survey on embodied ai. IEEE/ASME Transactions on Mechatronics, 2025

  9. [9]

    Beyond the destination: A novel benchmark for exploration-aware embodied question answering

    Kaixuan Jiang, Yang Liu, Weixing Chen, Jingzhou Luo, Ziliang Chen, Ling Pan, Guanbin Li, and Liang Lin. Beyond the destination: A novel benchmark for exploration-aware embodied question answering. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 9091–9101, 2025

  10. [10]

    Video diffusion models.Advances in neural information processing systems, 35:8633–8646, 2022

    Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video diffusion models.Advances in neural information processing systems, 35:8633–8646, 2022

  11. [11]

    CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

    Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024

  12. [12]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

  13. [13]

    Navigation world models

    Amir Bar, Gaoyue Zhou, Danny Tran, Trevor Darrell, and Yann LeCun. Navigation world models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 15791–15801, 2025

  14. [14]

    Rapid exploration for open-world navigation with latent goal models.arXiv preprint arXiv:2104.05859, 2021

    Dhruv Shah, Benjamin Eysenbach, Gregory Kahn, Nicholas Rhinehart, and Sergey Levine. Rapid exploration for open-world navigation with latent goal models.arXiv preprint arXiv:2104.05859, 2021

  15. [15]

    Socially compliant navigation dataset (scand): A large-scale dataset of demonstrations for social navigation.IEEE Robotics and Automation Letters, 7(4):11807–11814, 2022

    Haresh Karnan, Anirudh Nair, Xuesu Xiao, Garrett Warnell, Sören Pirk, Alexander Toshev, Justin Hart, Joydeep Biswas, and Peter Stone. Socially compliant navigation dataset (scand): A large-scale dataset of demonstrations for social navigation.IEEE Robotics and Automation Letters, 7(4):11807–11814, 2022. 10

  16. [16]

    Sacson: Scalable autonomous control for social navigation.IEEE Robotics and Automation Letters, 9(1):49–56, 2023

    Noriaki Hirose, Dhruv Shah, Ajay Sridhar, and Sergey Levine. Sacson: Scalable autonomous control for social navigation.IEEE Robotics and Automation Letters, 9(1):49–56, 2023

  17. [17]

    Diffusion for world modeling: Visual details matter in atari.Advances in Neural Information Processing Systems, 37:58757–58791, 2024

    Eloi Alonso, Adam Jelley, Vincent Micheli, Anssi Kanervisto, Amos Storkey, Tim Pearce, and François Fleuret. Diffusion for world modeling: Visual details matter in atari.Advances in Neural Information Processing Systems, 37:58757–58791, 2024

  18. [18]

    Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

    Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Do- minik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023

  19. [19]

    Decoupled Weight Decay Regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017

  20. [20]

    Droid-slam: Deep visual slam for monocular, stereo, and rgb-d cameras.Advances in neural information processing systems, 34:16558–16569, 2021

    Zachary Teed and Jia Deng. Droid-slam: Deep visual slam for monocular, stereo, and rgb-d cameras.Advances in neural information processing systems, 34:16558–16569, 2021

  21. [21]

    The unrea- sonable effectiveness of deep features as a perceptual metric

    Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unrea- sonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 586–595, 2018

  22. [22]

    Dreamsim: Learning new dimensions of human visual similarity using synthetic data.Advances in Neural Information Processing Systems, 36:50742–50768, 2023

    Stephanie Fu, Netanel Tamir, Shobhita Sundaram, Lucy Chai, Richard Zhang, Tali Dekel, and Phillip Isola. Dreamsim: Learning new dimensions of human visual similarity using synthetic data.Advances in Neural Information Processing Systems, 36:50742–50768, 2023

  23. [23]

    Nomad: Goal masked diffusion policies for navigation and exploration

    Ajay Sridhar, Dhruv Shah, Catherine Glossop, and Sergey Levine. Nomad: Goal masked diffusion policies for navigation and exploration. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 63–70. IEEE, 2024

  24. [24]

    Goal-conditioned reinforcement learning for data-driven maritime navigation

    Vaishnav Vaidheeswaran, Dilith Jayakody, Samruddhi Mulay, Anand Lo, Md Mahbub Alam, and Gabriel Spadon. Goal-conditioned reinforcement learning for data-driven maritime navigation. arXiv preprint arXiv:2509.01838, 2025

  25. [25]

    An efficient and multi-modal navigation system with one-step world model.arXiv preprint arXiv:2601.12277, 2026

    Wangtian Shen, Ziyang Meng, Jinming Ma, Mingliang Zhou, and Diyun Xiang. An efficient and multi-modal navigation system with one-step world model.arXiv preprint arXiv:2601.12277, 2026

  26. [26]

    Vision-and-language navi- gation: A survey of tasks, methods, and future directions

    Jing Gu, Eliana Stefani, Qi Wu, Jesse Thomason, and Xin Wang. Vision-and-language navi- gation: A survey of tasks, methods, and future directions. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7606–7623, 2022

  27. [27]

    Navgpt: Explicit reasoning in vision-and-language navigation with large language models

    Gengze Zhou, Yicong Hong, and Qi Wu. Navgpt: Explicit reasoning in vision-and-language navigation with large language models. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 7641–7649, 2024

  28. [28]

    Profocus: Proactive perception and focused reasoning in vision-and-language navigation.arXiv preprint arXiv:2603.05530, 2026

    Wei Xue, Mingcheng Li, Xuecheng Wu, Jingqun Tang, Dingkang Yang, and Lihua Zhang. Profocus: Proactive perception and focused reasoning in vision-and-language navigation.arXiv preprint arXiv:2603.05530, 2026

  29. [29]

    Effonav: An effective foundation- model-based visual navigation approach in challenging environment.IEEE Robotics and Automation Letters, 2025

    Wangtian Shen, Pengfei Gu, Haijian Qin, and Ziyang Meng. Effonav: An effective foundation- model-based visual navigation approach in challenging environment.IEEE Robotics and Automation Letters, 2025

  30. [30]

    Foundation-model-based action selection for behavior trees in navigation

    Michele Moriconi, Stefan Laible, and Carmine Recchiuto. Foundation-model-based action selection for behavior trees in navigation. In2025 European Conference on Mobile Robots (ECMR), pages 1–7. IEEE, 2025

  31. [31]

    Foundation model driven robotics: A compre- hensive review.arXiv preprint arXiv:2507.10087, 2025

    Muhammad Tayyab Khan and Ammar Waheed. Foundation model driven robotics: A compre- hensive review.arXiv preprint arXiv:2507.10087, 2025

  32. [32]

    Gnm: A general navigation model to drive any robot

    Dhruv Shah, Ajay Sridhar, Arjun Bhorkar, Noriaki Hirose, and Sergey Levine. Gnm: A general navigation model to drive any robot. In2023 IEEE International Conference on Robotics and Automation (ICRA), pages 7226–7233. IEEE, 2023. 11

  33. [33]

    Vint: A foundation model for visual navigation.arXiv preprint arXiv:2306.14846, 2023

    Dhruv Shah, Ajay Sridhar, Nitish Dashora, Kyle Stachowicz, Kevin Black, Noriaki Hi- rose, and Sergey Levine. Vint: A foundation model for visual navigation.arXiv preprint arXiv:2306.14846, 2023

  34. [34]

    Towards long-horizon vision-language navigation: Platform, benchmark and method

    Xinshuai Song, Weixing Chen, Yang Liu, Vincent Chan, Guanbin Li, and Liang Lin. Towards long-horizon vision-language navigation: Platform, benchmark and method. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

  35. [35]

    Embodied navigation foundation model.arXiv preprint arXiv:2509.12129, 2025

    Jiazhao Zhang, Anqi Li, Yunpeng Qi, Minghan Li, Jiahang Liu, Shaoan Wang, Haoran Liu, Gengze Zhou, Yuze Wu, Xingxing Li, et al. Embodied navigation foundation model.arXiv preprint arXiv:2509.12129, 2025

  36. [36]

    Vln-r1: Vision-language navigation via reinforcement fine-tuning.arXiv preprint arXiv:2506.17221, 2025

    Zhangyang Qi, Zhixiong Zhang, Yizhou Yu, Jiaqi Wang, and Hengshuang Zhao. Vln-r1: Vision-language navigation via reinforcement fine-tuning.arXiv preprint arXiv:2506.17221, 2025

  37. [37]

    ImagineNav++: Prompting Vision-Language Models as Embodied Navigator through Scene Imagination

    Teng Wang, Xinxin Zhao, Wenzhe Cai, and Changyin Sun. Imaginenav++: Prompting vision-language models as embodied navigator through scene imagination.arXiv preprint arXiv:2512.17435, 2025

  38. [38]

    Vld: Visual language goal distance for reinforce- ment learning navigation.arXiv preprint arXiv:2512.07976, 2025

    Lazar Milikic, Manthan Patel, and Jonas Frey. Vld: Visual language goal distance for reinforce- ment learning navigation.arXiv preprint arXiv:2512.07976, 2025

  39. [39]

    Can Vision Foundation Models Navigate? Zero-Shot Real-World Evaluation and Lessons Learned

    Maeva Guerrier, Karthik Soma, Jana Pavlasek, and Giovanni Beltrame. Can vision founda- tion models navigate? zero-shot real-world evaluation and lessons learned.arXiv preprint arXiv:2603.25937, 2026

  40. [40]

    Navforesee: A unified vision-language world model for hierarchi- cal planning and dual-horizon navigation prediction

    Fei Liu, Shichao Xie, Minghua Luo, Zedong Chu, Junjun Hu, Xiaolong Wu, and Mu Xu. Navforesee: A unified vision-language world model for hierarchical planning and dual-horizon navigation prediction.arXiv preprint arXiv:2512.01550, 2025

  41. [41]

    Dream to Control: Learning Behaviors by Latent Imagination

    Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Mohammad Norouzi. Dream to control: Learning behaviors by latent imagination.arXiv preprint arXiv:1912.01603, 2019

  42. [42]

    Mastering Diverse Domains through World Models

    Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse domains through world models.arXiv preprint arXiv:2301.04104, 2023

  43. [43]

    Dreaming: Model-based reinforcement learning by latent imagination without reconstruction

    Masashi Okada and Tadahiro Taniguchi. Dreaming: Model-based reinforcement learning by latent imagination without reconstruction. In2021 ieee international conference on robotics and automation (icra), pages 4209–4215. IEEE, 2021

  44. [44]

    Model-based reinforcement learning via imagination with derived memory.Advances in Neural Information Processing Systems, 34:9493–9505, 2021

    Yao Mu, Yuzheng Zhuang, Bin Wang, Guangxiang Zhu, Wulong Liu, Jianyu Chen, Ping Luo, Shengbo Li, Chongjie Zhang, and Jianye Hao. Model-based reinforcement learning via imagination with derived memory.Advances in Neural Information Processing Systems, 34:9493–9505, 2021

  45. [45]

    Dreamerad: Efficient reinforcement learning via latent world model for autonomous driving.arXiv preprint arXiv:2603.24587, 2026

    Pengxuan Yang, Yupeng Zheng, Deheng Qian, Zebin Xing, Qichao Zhang, Linbo Wang, Yichen Zhang, Shaoyu Guo, Zhongpu Xia, Qiang Chen, et al. Dreamerad: Efficient reinforcement learning via latent world model for autonomous driving.arXiv preprint arXiv:2603.24587, 2026

  46. [46]

    4d latent world model for robot planning.OpenReview preprint, 2026

    Zhiyi Li, Peilin Wu, Xiaoshen Han, Ruojin Cai, and Yilun Du. 4d latent world model for robot planning.OpenReview preprint, 2026

  47. [47]

    Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

  48. [48]

    High- resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High- resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022

  49. [49]

    Video diffusion models.Advances in neural information processing systems, 35:8633–8646, 2022

    Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video diffusion models.Advances in neural information processing systems, 35:8633–8646, 2022. 12

  50. [50]

    Imagen Video: High Definition Video Generation with Diffusion Models

    Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P Kingma, Ben Poole, Mohammad Norouzi, David J Fleet, et al. Imagen video: High definition video generation with diffusion models.arXiv preprint arXiv:2210.02303, 2022

  51. [51]

    Align your latents: High-resolution video synthesis with latent diffusion models

    Andreas Blattmann, Robin Rombach, Huan Ling, Tim Dockhorn, Seung Wook Kim, Sanja Fidler, and Karsten Kreis. Align your latents: High-resolution video synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22563–22575, 2023

  52. [52]

    AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning

    Yuwei Guo, Ceyuan Yang, Anyi Rao, Zhengyang Liang, Yaohui Wang, Yu Qiao, Maneesh Agrawala, Dahua Lin, and Bo Dai. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning.arXiv preprint arXiv:2307.04725, 2023

  53. [53]

    Open-Sora: Democratizing Efficient Video Production for All

    Zangwei Zheng, Xiangyu Peng, Tianji Yang, Chenhui Shen, Shenggui Li, Hongxin Liu, Yukun Zhou, Tianyi Li, and Yang You. Open-sora: Democratizing efficient video production for all. arXiv preprint arXiv:2412.20404, 2024

  54. [54]

    Vid2world: Crafting video diffusion models to interactive world models,

    Siqiao Huang, Jialong Wu, Qixing Zhou, Shangchen Miao, and Mingsheng Long. Vid2world: Crafting video diffusion models to interactive world models.arXiv preprint arXiv:2505.14357, 2025

  55. [55]

    World Action Models are Zero-shot Policies

    Seonghyeon Ye, Yunhao Ge, Kaiyuan Zheng, Shenyuan Gao, Sihyun Yu, George Kurian, Suneel Indupuru, You Liang Tan, Chuning Zhu, Jiannan Xiang, et al. World action models are zero-shot policies.arXiv preprint arXiv:2602.15922, 2026

  56. [56]

    Aether: Geometric-aware unified world modeling

    Haoyi Zhu, Yifan Wang, Jianjun Zhou, Wenzheng Chang, Yang Zhou, Zizun Li, Junyi Chen, Chunhua Shen, Jiangmiao Pang, and Tong He. Aether: Geometric-aware unified world modeling. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 8535– 8546, 2025

  57. [57]

    Language-conditioned world modeling for visual navigation.arXiv preprint arXiv:2603.26741, 2026

    Yifei Dong, Fengyi Wu, Yilong Dai, Lingdong Kong, Guangyu Chen, Xu Zhu, Qiyu Hu, Tianyu Wang, Johnalbert Garnica, Feng Liu, et al. Language-conditioned world modeling for visual navigation.arXiv preprint arXiv:2603.26741, 2026

  58. [58]

    Learning Interactive Real-World Simulators

    Mengjiao Yang, Yilun Du, Kamyar Ghasemipour, Jonathan Tompson, Dale Schuurmans, and Pieter Abbeel. Learning interactive real-world simulators.arXiv preprint arXiv:2310.06114, 1(2):6, 2023

  59. [59]

    Unified world models: Memory-augmented planning and foresight for visual navigation.arXiv preprint arXiv:2510.08713, 2025

    Yifei Dong, Fengyi Wu, Guangyu Chen, Zhi-Qi Cheng, Qiyu Hu, Yuxuan Zhou, Jingdong Sun, Jun-Yan He, Qi Dai, and Alexander G Hauptmann. Unified world models: Memory-augmented planning and foresight for visual navigation.arXiv preprint arXiv:2510.08713, 2025

  60. [60]

    Mowm: Mixture-of-world-models for embodied planning via latent-to-pixel feature modulation.arXiv preprint arXiv:2509.21797, 2025

    Yangcheng Yu, Xin Jin, Yu Shang, Xin Zhang, Haisheng Su, Wei Wu, and Yong Li. Mowm: Mixture-of-world-models for embodied planning via latent-to-pixel feature modulation.arXiv preprint arXiv:2509.21797, 2025

  61. [61]

    Gigaworld-policy: An efficient action-centered world–action model.arXiv preprint arXiv:2603.17240, 2026

    Angen Ye, Boyuan Wang, Chaojun Ni, Guan Huang, Guosheng Zhao, Hao Li, Hengtao Li, Jie Li, Jindi Lv, Jingyu Liu, et al. Gigaworld-policy: An efficient action-centered world–action model.arXiv preprint arXiv:2603.17240, 2026

  62. [62]

    Towards high-consistency embodied world model with multi-view trajectory videos.arXiv preprint arXiv:2511.12882, 2025

    Taiyi Su, Jian Zhu, Yaxuan Li, Chong Ma, Jianjun Zhang, Zitai Huang, Hanli Wang, and Yi Xu. Towards high-consistency embodied world model with multi-view trajectory videos.arXiv preprint arXiv:2511.12882, 2025

  63. [63]

    Worldsimbench: Towards video generation models as world simulators.arXiv preprint arXiv:2410.18072, 2024

    Yiran Qin, Zhelun Shi, Jiwen Yu, Xijun Wang, Enshen Zhou, Lijun Li, Zhenfei Yin, Xihui Liu, Lu Sheng, Jing Shao, et al. Worldsimbench: Towards video generation models as world simulators.arXiv preprint arXiv:2410.18072, 2024

  64. [64]

    Wow, wo, val! a comprehensive embodied world model evaluation turing test.arXiv preprint arXiv:2601.04137, 2026

    Chun-Kai Fan, Xiaowei Chi, Xiaozhu Ju, Hao Li, Yong Bao, Yu-Kai Wang, Lizhang Chen, Zhiyuan Jiang, Kuangzhi Ge, Ying Li, et al. Wow, wo, val! a comprehensive embodied world model evaluation turing test.arXiv preprint arXiv:2601.04137, 2026. 13

  65. [65]

    Worldarena: A unified benchmark for evaluating perception and functional utility of embodied world models.arXiv preprint arXiv:2602.08971, 2026

    Yu Shang, Zhuohang Li, Yiding Ma, Weikang Su, Xin Jin, Ziyou Wang, Lei Jin, Xin Zhang, Yinzhou Tang, Haisheng Su, et al. Worldarena: A unified benchmark for evaluating perception and functional utility of embodied world models.arXiv preprint arXiv:2602.08971, 2026

  66. [66]

    WorldLens: Full-Spectrum Evaluations of Driving World Models in Real World

    Ao Liang, Lingdong Kong, Tianyi Yan, Hongsi Liu, Wesley Yang, Ziqi Huang, Wei Yin, Jialong Zuo, Yixuan Hu, Dekai Zhu, et al. Worldlens: Full-spectrum evaluations of driving world models in real world.arXiv preprint arXiv:2512.10958, 2025

  67. [67]

    Mind: Benchmarking memory consistency and action control in world models.arXiv preprint arXiv:2602.08025, 2026

    Yixuan Ye, Xuanyu Lu, Yuxin Jiang, Yuchao Gu, Rui Zhao, Qiwei Liang, Jiachun Pan, Fengda Zhang, Weijia Wu, and Alex Jinpeng Wang. Mind: Benchmarking memory consistency and action control in world models.arXiv preprint arXiv:2602.08025, 2026

  68. [68]

    LoopNav: Benchmarking Spatial Consistency in World Models

    Kewei Lian, Shaofei Cai, Yilun Du, and Yitao Liang. Toward memory-aided world models: Benchmarking via spatial consistency.arXiv preprint arXiv:2505.22976, 2025

  69. [69]

    Toward Consistent World Models with Multi-Token Prediction and Latent Semantic Enhancement

    Qimin Zhong, Hao Liao, Haiming Qin, Mingyang Zhou, Rui Mao, Wei Chen, and Naipeng Chao. Toward consistent world models with multi-token prediction and latent semantic enhancement. arXiv preprint arXiv:2604.06155, 2026

  70. [70]

    Wildworld: A large-scale dataset for dynamic world modeling with actions and explicit state toward generative arpg.arXiv preprint arXiv:2603.23497, 2026

    Zhen Li, Zian Meng, Shuwei Shi, Wenshuo Peng, Yuwei Wu, Bo Zheng, Chuanhao Li, and Kaipeng Zhang. Wildworld: A large-scale dataset for dynamic world modeling with actions and explicit state toward generative arpg.arXiv preprint arXiv:2603.23497, 2026

  71. [71]

    Mwm: Mobile world models for action-conditioned consistent prediction.arXiv preprint arXiv:2603.07799, 2026

    Han Yan, Zishang Xiang, Zeyu Zhang, and Hao Tang. Mwm: Mobile world models for action-conditioned consistent prediction.arXiv preprint arXiv:2603.07799, 2026

  72. [72]

    Generative modeling of molecular dynamics trajectories.Advances in Neural Information Processing Systems, 37:40534– 40564, 2024

    Bowen Jing, Hannes Stärk, Tommi Jaakkola, and Bonnie Berger. Generative modeling of molecular dynamics trajectories.Advances in Neural Information Processing Systems, 37:40534– 40564, 2024

  73. [73]

    Scalable emulation of protein equilibrium ensembles with generative deep learning.Science, 389(6761):eadv9817, 2025

    Sarah Lewis, Tim Hempel, José Jiménez-Luna, Michael Gastegger, Yu Xie, Andrew YK Foong, Victor García Satorras, Osama Abdin, Bastiaan S Veeling, Iryna Zaporozhets, et al. Scalable emulation of protein equilibrium ensembles with generative deep learning.Science, 389(6761):eadv9817, 2025

  74. [74]

    Beyond ensembles: Simulating all-atom protein dynamics in a learned latent space.arXiv preprint arXiv:2509.02196, 2025

    Aditya Sengar, Jiying Zhang, Pierre Vandergheynst, and Patrick Barth. Beyond ensembles: Simulating all-atom protein dynamics in a learned latent space.arXiv preprint arXiv:2509.02196, 2025

  75. [75]

    Conditional diffusion with locality-aware modal alignment for generating diverse protein conformational ensembles.Nature Machine Intelligence, pages 1–20, 2026

    Baoli Wang, Chenglin Wang, Jingyang Chen, Danlin Liu, Changzhi Sun, Jie Zhang, Kai Zhang, and Honglin Li. Conditional diffusion with locality-aware modal alignment for generating diverse protein conformational ensembles.Nature Machine Intelligence, pages 1–20, 2026

  76. [76]

    Pathdiffusion: modeling protein folding pathway using evolution-guided diffusion.bioRxiv, pages 2026–01, 2026

    Kailong Zhao, Chenxiao Xiang, Bin Cheng, Yunyun Shen, Wenkai Wang, Shuyun Chen, Baoquan Su, Guijun Zhang, Zhenling Peng, and Jianyi Yang. Pathdiffusion: modeling protein folding pathway using evolution-guided diffusion.bioRxiv, pages 2026–01, 2026

  77. [77]

    Tensor field networks: Rotation- and translation-equivariant neural networks for 3D point clouds

    Nathaniel Thomas, Tess Smidt, Steven Kearnes, Lusann Yang, Li Li, Kai Kohlhoff, and Patrick Riley. Tensor field networks: Rotation-and translation-equivariant neural networks for 3d point clouds.arXiv preprint arXiv:1802.08219, 2018

  78. [78]

    E (n) equivariant graph neural networks

    Vıctor Garcia Satorras, Emiel Hoogeboom, and Max Welling. E (n) equivariant graph neural networks. InInternational conference on machine learning, pages 9323–9332. PMLR, 2021

  79. [79]

    Equivariant diffusion for molecule generation in 3d

    Emiel Hoogeboom, Vıctor Garcia Satorras, Clément Vignac, and Max Welling. Equivariant diffusion for molecule generation in 3d. InInternational conference on machine learning, pages 8867–8887. PMLR, 2022

  80. [80]

    Geodiff: A geo- metric diffusion model for molecular conformation generation.arXiv preprint arXiv:2203.02923, 2022

    Minkai Xu, Lantao Yu, Yang Song, Chence Shi, Stefano Ermon, and Jian Tang. Geodiff: A geo- metric diffusion model for molecular conformation generation.arXiv preprint arXiv:2203.02923, 2022. 14

Showing first 80 references.