World Models as Group Actions

Fanqi Zhang; Guanbin Li; Weiming Zhang; Wei Zhang; Xiao Tan; Yipeng Qin; Zijie Wang

arxiv: 2605.24578 · v1 · pith:VSXCWOS7new · submitted 2026-05-23 · 💻 cs.CV

World Models as Group Actions

Zijie Wang , Wei Zhang , Weiming Zhang , Fanqi Zhang , Xiao Tan , Yipeng Qin , Guanbin Li This is my paper

Pith reviewed 2026-06-30 13:15 UTC · model grok-4.3

classification 💻 cs.CV

keywords video world modelsgroup actionsaction consistencylatent regularizationSE(2) navigationworld model dynamicsGAC metricGAR metric

0 comments

The pith

Video world models realize group actions on latent states when identity, inverse, and composition rules are enforced through regularization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that action faithfulness in video world models is best understood as the model realizing a group action on the state space, such as the special Euclidean group for navigation tasks. This provides a clear criterion beyond visual realism: the model must respect how actions compose, invert, and include an identity element. The authors operationalize this by adding latent-space regularization terms that enforce these properties using only synthesized supervision, without needing extra real data. Experiments on state-of-the-art models show gains in the proposed Group-Action Consistency and Group-Action Robustness metrics while keeping perceptual quality intact. A sympathetic reader would care because this turns world models from video generators into reliable simulators for planning and control in embodied settings.

Core claim

The central claim is that action-conditioned world modeling amounts to realizing a group action on the state space, and that enforcing the group properties of identity, inverse, and composition consistency through latent regularization with synthesized data improves structural correctness of the dynamics.

What carries the argument

Realizing group actions on the latent state space via consistency regularization on identity, inverse, and composition properties using synthesized supervision.

If this is right

Action sequences will compose correctly in the model's predictions.
Inverse actions will reliably return the state to its prior condition.
Identity actions will leave the predicted state unchanged.
Rollouts will exhibit greater stability over long horizons under the group constraints.
These improvements occur without any loss in visual fidelity of the generated videos.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Applying the same regularization to other domains like robotic manipulation could enforce their respective group structures such as SE(3).
Future work might derive the group structure automatically from data rather than assuming it a priori.
The metrics GAC and GAR could serve as training objectives in addition to evaluation.

Load-bearing premise

The assumption that embodied action dynamics exactly follow a group structure that can be captured and enforced in the existing latent space of video world models.

What would settle it

Observing that after applying the regularization, the model's predictions for composite actions deviate significantly from the direct prediction of the composed action on real video data.

Figures

Figures reproduced from arXiv: 2605.24578 by Fanqi Zhang, Guanbin Li, Weiming Zhang, Wei Zhang, Xiao Tan, Yipeng Qin, Zijie Wang.

**Figure 1.** Figure 1: Consequences of group-action-faithful dynamics. Identity, inverse consistency, and compatibility are observable conditions implied by the group-action formulation. Representative examples illustrate that GA regularization reduces zero-action drift, improves forward–inverse recovery, and yields more compatible outcomes under equivalent action decompositions. the underlying group action structure. Leveraging… view at source ↗

**Figure 2.** Figure 2: Preliminary evidence of action inconsistency. (a) Repeated stochastic rollouts from the same initial observation and identical action sequence exhibit divergent motion. (b) Controlled probes reveal violations of identity, inverse consistency, and compatibility, manifested as zero-action drift, incomplete inverse recovery, and endpoint mismatch under equivalent action decompositions. 2.2 Do Native Video Wor… view at source ↗

**Figure 3.** Figure 3: Group-Action Consistency (GAC) errors on RECON under different probe configurations. [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Qualitative results for Group-Action Consistency (GAC) and Group-Action Robustness [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Additional qualitative examples of weak rollout-level robustness in the baseline NWM. [PITH_FULL_IMAGE:figures/full_fig_p025_5.png] view at source ↗

**Figure 6.** Figure 6: Additional qualitative examples of local GAC failures in the baseline NWM. The figure [PITH_FULL_IMAGE:figures/full_fig_p026_6.png] view at source ↗

**Figure 7.** Figure 7: Additional qualitative examples of repeated-rollout robustness after GA training. Each [PITH_FULL_IMAGE:figures/full_fig_p027_7.png] view at source ↗

**Figure 8.** Figure 8: Additional qualitative examples of local group-action behavior after GA training. The [PITH_FULL_IMAGE:figures/full_fig_p028_8.png] view at source ↗

read the original abstract

Video world models have achieved strong visual realism, but this does not ensure that their dynamics are truly governed by actions. In this work, we argue that action faithfulness should be understood through the compositional structure of actions, which in many embodied settings follows a group structure (e.g., SE(2) for navigation). Based on this insight, we formalize action-conditioned world modeling as realizing a group action on the state space, providing a principled criterion for evaluating dynamics beyond visual quality. To operationalize this framework, we propose a unified approach that enforces identity, inverse, and composition consistency via latent-space regularization with synthesized supervision, avoiding additional data collection. We further introduce two metrics: Group-Action Consistency (GAC) and Group-Action Robustness (GAR), to evaluate structural correctness and rollout stability. Extensive experimental results show that our method consistently improves both GAC and GAR in state-of-the-art video world models without degrading perceptual quality.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The group-action framing adds a clean criterion for consistency in world models but the abstract leaves the experiments and grounding too thin to judge if the gains are real or constructed.

read the letter

The paper's central move is to treat action-conditioned video prediction as a group action on latent states, then enforce identity, inverse, and composition consistency through latent regularization that uses synthesized supervision instead of new data. It also defines GAC and GAR metrics to measure how well those properties hold.

What is new is the explicit group-theoretic criterion and the unified regularization scheme that targets all three axioms at once. The metrics are a practical addition for evaluating structural correctness beyond pixel-level quality.

The framing itself is reasonable for settings where actions have clear algebraic structure, such as rigid-body navigation. It gives a principled way to talk about faithfulness without requiring changes to the base architecture.

The main weakness is that the abstract reports "consistent improvements" with no baselines, datasets, effect sizes, or controls shown. That makes it impossible to tell whether the method actually improves dynamics or simply satisfies the loss terms by construction. The stress-test concern about circularity lands: if the synthesized supervision comes from the model's own rollouts or assumed group operations on latents, then GAC and GAR gains follow directly from the regularization rather than from better alignment with real action trajectories.

This work is aimed at researchers building video world models for robotics and planning. Readers who want a formal handle on action consistency will find the setup useful even if they end up modifying the details.

It deserves a serious referee to check the full experiments and see whether the group structure is independently verified or just imposed.

Referee Report

3 major / 2 minor

Summary. The paper claims that action faithfulness in video world models should be understood via the compositional group structure of actions (e.g., SE(2) for navigation), formalizes action-conditioned world modeling as realizing a group action on the state space, proposes latent-space regularization with synthesized supervision to enforce identity/inverse/composition consistency, introduces GAC and GAR metrics, and reports that this yields consistent improvements in SOTA models without degrading perceptual quality.

Significance. If the improvements are shown to be independent of the regularization and the metrics provide an external criterion, the framework supplies a structured, group-theoretic basis for evaluating dynamics beyond visual fidelity; this would be relevant for embodied planning and control. The avoidance of new data collection via synthesized supervision is a practical positive if the synthesis is externally grounded.

major comments (3)

[§3] §3 (synthesized supervision): the generation process for the supervision signals must be specified in detail; if supervision is produced from the base model's own predicted transitions or by applying assumed group operations directly in latent space without external grounding in real action trajectories, the regularization is self-referential and the claim of an independent principled criterion is undermined.
[§4] §4 (GAC/GAR definitions): the metrics directly quantify the same identity, inverse, and composition properties that are enforced by the loss terms; this makes reported gains in GAC/GAR follow by construction from the added regularization rather than from improved alignment with embodied dynamics, weakening the central claim that the approach supplies an independent evaluation criterion.
[§5] §5 (experiments): the manuscript must report concrete baselines, datasets, effect sizes, statistical controls, and ablation results; without these the abstract claim of 'consistent improvements' cannot be verified and may reflect post-hoc metric choices or insufficient controls.

minor comments (2)

Notation for the group homomorphism and latent-space mappings should be introduced with explicit equations at first use to improve clarity.
Ensure every acronym (GAC, GAR, SE(2)) is expanded on first occurrence in the body text.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. We address each major comment point by point below, indicating where revisions will be incorporated.

read point-by-point responses

Referee: §3 (synthesized supervision): the generation process for the supervision signals must be specified in detail; if supervision is produced from the base model's own predicted transitions or by applying assumed group operations directly in latent space without external grounding in real action trajectories, the regularization is self-referential and the claim of an independent principled criterion is undermined.

Authors: We agree that the generation process for synthesized supervision must be specified with full precision. The revised manuscript will include an expanded description in §3 detailing that supervision signals are generated by applying the known group operations (identity, inverse, composition) to ground-truth action labels drawn from the original dataset trajectories, rather than from model predictions or direct latent-space manipulations. This ensures external grounding in real action data and preserves the independence of the regularization criterion. revision: yes
Referee: §4 (GAC/GAR definitions): the metrics directly quantify the same identity, inverse, and composition properties that are enforced by the loss terms; this makes reported gains in GAC/GAR follow by construction from the added regularization rather than from improved alignment with embodied dynamics, weakening the central claim that the approach supplies an independent evaluation criterion.

Authors: The metrics and loss terms target the same group properties by design, as both operationalize the formalization in §2. However, GAC and GAR are evaluated on held-out test trajectories and multi-step rollouts, providing a measure of generalization to unseen data, whereas the regularization is applied only during training. We will revise §4 to explicitly distinguish these roles and clarify that the metrics serve as an external diagnostic of structural correctness beyond training objectives, thereby supporting their use as an independent evaluation criterion. revision: partial
Referee: §5 (experiments): the manuscript must report concrete baselines, datasets, effect sizes, statistical controls, and ablation results; without these the abstract claim of 'consistent improvements' cannot be verified and may reflect post-hoc metric choices or insufficient controls.

Authors: We acknowledge that the current experimental section requires additional concrete details for full verifiability. The revised manuscript will expand §5 to report specific model baselines, exact dataset names and splits, quantitative effect sizes with standard deviations across multiple runs, statistical significance tests, and further ablation studies isolating the contribution of each consistency term. revision: yes

Circularity Check

1 steps flagged

GAC/GAR metrics improve by construction from regularization enforcing the same group consistencies they measure

specific steps

fitted input called prediction [Abstract]
"we propose a unified approach that enforces identity, inverse, and composition consistency via latent-space regularization with synthesized supervision, avoiding additional data collection. We further introduce two metrics: Group-Action Consistency (GAC) and Group-Action Robustness (GAR), to evaluate structural correctness and rollout stability. Extensive experimental results show that our method consistently improves both GAC and GAR in state-of-the-art video world models without degrading perceptual quality."

GAC and GAR are defined to measure precisely the identity/inverse/composition consistencies that the regularization loss directly optimizes; therefore reported metric gains follow by construction from applying the synthesized-supervision objective rather than from external verification that the latent dynamics obey an independent group structure.

full rationale

The paper defines action faithfulness as realizing a group action (identity, inverse, composition), operationalizes it by enforcing those exact properties via latent regularization with synthesized supervision, and then reports that the method 'consistently improves both GAC and GAR'. Because GAC and GAR are introduced specifically 'to evaluate structural correctness' of those same consistencies, the claimed gains on the metrics reduce to a direct consequence of the loss terms rather than independent evidence. The perceptual-quality claim is not circular, but the central structural improvement is. No self-citation chains or other patterns appear in the given text.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that action dynamics admit a group structure and that this structure can be imposed via latent regularization without new data; no free parameters or invented entities are mentioned in the abstract.

axioms (1)

domain assumption Actions in embodied settings follow a group structure (e.g., SE(2) for navigation)
Invoked in the abstract as the basis for formalizing action-conditioned world modeling as realizing a group action.

pith-pipeline@v0.9.1-grok · 5697 in / 1312 out tokens · 31629 ms · 2026-06-30T13:15:03.868084+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

ATM: Action-Consistency Transfer Matrix for Diagnosing and Improving Latent World Models
cs.CV 2026-06 unverdicted novelty 6.0

ATM is a post-hoc probe-based transfer matrix that diagnoses action consistency in latent world models and serves as a training signal via AITS, enabling fast reliable ranking with claimed 100x speedup over CEM planne...

Reference graph

Works this paper leans on

88 extracted references · 45 canonical work pages · cited by 1 Pith paper · 18 internal anchors

[1]

Llm-planner: Few-shot grounded planning for embodied agents with large language models

Chan Hee Song, Jiaman Wu, Clayton Washington, Brian M Sadler, Wei-Lun Chao, and Yu Su. Llm-planner: Few-shot grounded planning for embodied agents with large language models. In Proceedings of the IEEE/CVF international conference on computer vision, pages 2998–3009, 2023

2023
[2]

Grhp: Graph- fused hierarchical planning for embodied long-horizon robotic task.Engineering Applications of Artificial Intelligence, 165:113413, 2026

Xiaodong Li, Guohui Tian, Yongcheng Cui, Xuyang Shao, and Zhiwei Wang. Grhp: Graph- fused hierarchical planning for embodied long-horizon robotic task.Engineering Applications of Artificial Intelligence, 165:113413, 2026

2026
[3]

Embodied Task Planning via Graph-Informed Action Generation with Large Language Models

Xiang Li, Ning Yan, and Masood Mortazavi. Embodied task planning via graph-informed action generation with large language model.arXiv preprint arXiv:2601.21841, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[4]

Embodied agent interface: Benchmarking llms for embodied decision making.Advances in Neural Information Processing Systems, 37:100428–100534, 2024

Manling Li, Shiyu Zhao, Qineng Wang, Kangrui Wang, Yu Zhou, Sanjana Srivastava, Cem Gokmen, Tony Lee, Li E Li, Ruohan Zhang, et al. Embodied agent interface: Benchmarking llms for embodied decision making.Advances in Neural Information Processing Systems, 37:100428–100534, 2024

2024
[5]

Vlabench: A large-scale benchmark for language- conditioned robotics manipulation with long-horizon reasoning tasks

Shiduo Zhang, Zhe Xu, Peiju Liu, Xiaopeng Yu, Yuan Li, Qinghui Gao, Zhaoye Fei, Zhangyue Yin, Zuxuan Wu, Yu-Gang Jiang, et al. Vlabench: A large-scale benchmark for language- conditioned robotics manipulation with long-horizon reasoning tasks. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 11142–11152, 2025

2025
[6]

Large model empowered embodied ai: A survey on decision-making and embodied learning

Wenlong Liang, Rui Zhou, Yang Ma, Bing Zhang, Songlin Li, Yijia Liao, and Ping Kuang. Large model empowered embodied ai: A survey on decision-making and embodied learning. arXiv preprint arXiv:2508.10399, 2025

work page arXiv 2025
[7]

Meia: Multimodal embodied perception and interaction in unknown environments.arXiv preprint arXiv:2402.00290, 2024

Yang Liu, Xinshuai Song, Kaixuan Jiang, Weixing Chen, Jingzhou Luo, Guanbin Li, and Liang Lin. Meia: Multimodal embodied perception and interaction in unknown environments.arXiv preprint arXiv:2402.00290, 2024

work page arXiv 2024
[8]

Aligning cyber space with physical world: A comprehensive survey on embodied ai

Yang Liu, Weixing Chen, Yongjie Bai, Xiaodan Liang, Guanbin Li, Wen Gao, and Liang Lin. Aligning cyber space with physical world: A comprehensive survey on embodied ai. IEEE/ASME Transactions on Mechatronics, 2025

2025
[9]

Beyond the destination: A novel benchmark for exploration-aware embodied question answering

Kaixuan Jiang, Yang Liu, Weixing Chen, Jingzhou Luo, Ziliang Chen, Ling Pan, Guanbin Li, and Liang Lin. Beyond the destination: A novel benchmark for exploration-aware embodied question answering. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 9091–9101, 2025

2025
[10]

Video diffusion models.Advances in neural information processing systems, 35:8633–8646, 2022

Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video diffusion models.Advances in neural information processing systems, 35:8633–8646, 2022

2022
[11]

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[12]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[13]

Navigation world models

Amir Bar, Gaoyue Zhou, Danny Tran, Trevor Darrell, and Yann LeCun. Navigation world models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 15791–15801, 2025

2025
[14]

Rapid exploration for open-world navigation with latent goal models.arXiv preprint arXiv:2104.05859, 2021

Dhruv Shah, Benjamin Eysenbach, Gregory Kahn, Nicholas Rhinehart, and Sergey Levine. Rapid exploration for open-world navigation with latent goal models.arXiv preprint arXiv:2104.05859, 2021

work page arXiv 2021
[15]

Socially compliant navigation dataset (scand): A large-scale dataset of demonstrations for social navigation.IEEE Robotics and Automation Letters, 7(4):11807–11814, 2022

Haresh Karnan, Anirudh Nair, Xuesu Xiao, Garrett Warnell, Sören Pirk, Alexander Toshev, Justin Hart, Joydeep Biswas, and Peter Stone. Socially compliant navigation dataset (scand): A large-scale dataset of demonstrations for social navigation.IEEE Robotics and Automation Letters, 7(4):11807–11814, 2022. 10

2022
[16]

Sacson: Scalable autonomous control for social navigation.IEEE Robotics and Automation Letters, 9(1):49–56, 2023

Noriaki Hirose, Dhruv Shah, Ajay Sridhar, and Sergey Levine. Sacson: Scalable autonomous control for social navigation.IEEE Robotics and Automation Letters, 9(1):49–56, 2023

2023
[17]

Diffusion for world modeling: Visual details matter in atari.Advances in Neural Information Processing Systems, 37:58757–58791, 2024

Eloi Alonso, Adam Jelley, Vincent Micheli, Anssi Kanervisto, Amos Storkey, Tim Pearce, and François Fleuret. Diffusion for world modeling: Visual details matter in atari.Advances in Neural Information Processing Systems, 37:58757–58791, 2024

2024
[18]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Do- minik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[19]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[20]

Droid-slam: Deep visual slam for monocular, stereo, and rgb-d cameras.Advances in neural information processing systems, 34:16558–16569, 2021

Zachary Teed and Jia Deng. Droid-slam: Deep visual slam for monocular, stereo, and rgb-d cameras.Advances in neural information processing systems, 34:16558–16569, 2021

2021
[21]

The unrea- sonable effectiveness of deep features as a perceptual metric

Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unrea- sonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 586–595, 2018

2018
[22]

Dreamsim: Learning new dimensions of human visual similarity using synthetic data.Advances in Neural Information Processing Systems, 36:50742–50768, 2023

Stephanie Fu, Netanel Tamir, Shobhita Sundaram, Lucy Chai, Richard Zhang, Tali Dekel, and Phillip Isola. Dreamsim: Learning new dimensions of human visual similarity using synthetic data.Advances in Neural Information Processing Systems, 36:50742–50768, 2023

2023
[23]

Nomad: Goal masked diffusion policies for navigation and exploration

Ajay Sridhar, Dhruv Shah, Catherine Glossop, and Sergey Levine. Nomad: Goal masked diffusion policies for navigation and exploration. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 63–70. IEEE, 2024

2024
[24]

Goal-conditioned reinforcement learning for data-driven maritime navigation

Vaishnav Vaidheeswaran, Dilith Jayakody, Samruddhi Mulay, Anand Lo, Md Mahbub Alam, and Gabriel Spadon. Goal-conditioned reinforcement learning for data-driven maritime navigation. arXiv preprint arXiv:2509.01838, 2025

work page arXiv 2025
[25]

An efficient and multi-modal navigation system with one-step world model.arXiv preprint arXiv:2601.12277, 2026

Wangtian Shen, Ziyang Meng, Jinming Ma, Mingliang Zhou, and Diyun Xiang. An efficient and multi-modal navigation system with one-step world model.arXiv preprint arXiv:2601.12277, 2026

work page arXiv 2026
[26]

Vision-and-language navi- gation: A survey of tasks, methods, and future directions

Jing Gu, Eliana Stefani, Qi Wu, Jesse Thomason, and Xin Wang. Vision-and-language navi- gation: A survey of tasks, methods, and future directions. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7606–7623, 2022

2022
[27]

Navgpt: Explicit reasoning in vision-and-language navigation with large language models

Gengze Zhou, Yicong Hong, and Qi Wu. Navgpt: Explicit reasoning in vision-and-language navigation with large language models. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 7641–7649, 2024

2024
[28]

Profocus: Proactive perception and focused reasoning in vision-and-language navigation.arXiv preprint arXiv:2603.05530, 2026

Wei Xue, Mingcheng Li, Xuecheng Wu, Jingqun Tang, Dingkang Yang, and Lihua Zhang. Profocus: Proactive perception and focused reasoning in vision-and-language navigation.arXiv preprint arXiv:2603.05530, 2026

work page arXiv 2026
[29]

Effonav: An effective foundation- model-based visual navigation approach in challenging environment.IEEE Robotics and Automation Letters, 2025

Wangtian Shen, Pengfei Gu, Haijian Qin, and Ziyang Meng. Effonav: An effective foundation- model-based visual navigation approach in challenging environment.IEEE Robotics and Automation Letters, 2025

2025
[30]

Foundation-model-based action selection for behavior trees in navigation

Michele Moriconi, Stefan Laible, and Carmine Recchiuto. Foundation-model-based action selection for behavior trees in navigation. In2025 European Conference on Mobile Robots (ECMR), pages 1–7. IEEE, 2025

2025
[31]

Foundation model driven robotics: A compre- hensive review.arXiv preprint arXiv:2507.10087, 2025

Muhammad Tayyab Khan and Ammar Waheed. Foundation model driven robotics: A compre- hensive review.arXiv preprint arXiv:2507.10087, 2025

work page arXiv 2025
[32]

Gnm: A general navigation model to drive any robot

Dhruv Shah, Ajay Sridhar, Arjun Bhorkar, Noriaki Hirose, and Sergey Levine. Gnm: A general navigation model to drive any robot. In2023 IEEE International Conference on Robotics and Automation (ICRA), pages 7226–7233. IEEE, 2023. 11

2023
[33]

Vint: A foundation model for visual navigation.arXiv preprint arXiv:2306.14846, 2023

Dhruv Shah, Ajay Sridhar, Nitish Dashora, Kyle Stachowicz, Kevin Black, Noriaki Hi- rose, and Sergey Levine. Vint: A foundation model for visual navigation.arXiv preprint arXiv:2306.14846, 2023

work page arXiv 2023
[34]

Towards long-horizon vision-language navigation: Platform, benchmark and method

Xinshuai Song, Weixing Chen, Yang Liu, Vincent Chan, Guanbin Li, and Liang Lin. Towards long-horizon vision-language navigation: Platform, benchmark and method. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

2025
[35]

Embodied navigation foundation model.arXiv preprint arXiv:2509.12129, 2025

Jiazhao Zhang, Anqi Li, Yunpeng Qi, Minghan Li, Jiahang Liu, Shaoan Wang, Haoran Liu, Gengze Zhou, Yuze Wu, Xingxing Li, et al. Embodied navigation foundation model.arXiv preprint arXiv:2509.12129, 2025

work page arXiv 2025
[36]

Vln-r1: Vision-language navigation via reinforcement fine-tuning.arXiv preprint arXiv:2506.17221, 2025

Zhangyang Qi, Zhixiong Zhang, Yizhou Yu, Jiaqi Wang, and Hengshuang Zhao. Vln-r1: Vision-language navigation via reinforcement fine-tuning.arXiv preprint arXiv:2506.17221, 2025

work page arXiv 2025
[37]

ImagineNav++: Prompting Vision-Language Models as Embodied Navigator through Scene Imagination

Teng Wang, Xinxin Zhao, Wenzhe Cai, and Changyin Sun. Imaginenav++: Prompting vision-language models as embodied navigator through scene imagination.arXiv preprint arXiv:2512.17435, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[38]

Vld: Visual language goal distance for reinforce- ment learning navigation.arXiv preprint arXiv:2512.07976, 2025

Lazar Milikic, Manthan Patel, and Jonas Frey. Vld: Visual language goal distance for reinforce- ment learning navigation.arXiv preprint arXiv:2512.07976, 2025

work page arXiv 2025
[39]

Can Vision Foundation Models Navigate? Zero-Shot Real-World Evaluation and Lessons Learned

Maeva Guerrier, Karthik Soma, Jana Pavlasek, and Giovanni Beltrame. Can vision founda- tion models navigate? zero-shot real-world evaluation and lessons learned.arXiv preprint arXiv:2603.25937, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[40]

Navforesee: A unified vision-language world model for hierarchi- cal planning and dual-horizon navigation prediction

Fei Liu, Shichao Xie, Minghua Luo, Zedong Chu, Junjun Hu, Xiaolong Wu, and Mu Xu. Navforesee: A unified vision-language world model for hierarchical planning and dual-horizon navigation prediction.arXiv preprint arXiv:2512.01550, 2025

work page arXiv 2025
[41]

Dream to Control: Learning Behaviors by Latent Imagination

Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Mohammad Norouzi. Dream to control: Learning behaviors by latent imagination.arXiv preprint arXiv:1912.01603, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1912
[42]

Mastering Diverse Domains through World Models

Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse domains through world models.arXiv preprint arXiv:2301.04104, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[43]

Dreaming: Model-based reinforcement learning by latent imagination without reconstruction

Masashi Okada and Tadahiro Taniguchi. Dreaming: Model-based reinforcement learning by latent imagination without reconstruction. In2021 ieee international conference on robotics and automation (icra), pages 4209–4215. IEEE, 2021

2021
[44]

Model-based reinforcement learning via imagination with derived memory.Advances in Neural Information Processing Systems, 34:9493–9505, 2021

Yao Mu, Yuzheng Zhuang, Bin Wang, Guangxiang Zhu, Wulong Liu, Jianyu Chen, Ping Luo, Shengbo Li, Chongjie Zhang, and Jianye Hao. Model-based reinforcement learning via imagination with derived memory.Advances in Neural Information Processing Systems, 34:9493–9505, 2021

2021
[45]

Dreamerad: Efficient reinforcement learning via latent world model for autonomous driving.arXiv preprint arXiv:2603.24587, 2026

Pengxuan Yang, Yupeng Zheng, Deheng Qian, Zebin Xing, Qichao Zhang, Linbo Wang, Yichen Zhang, Shaoyu Guo, Zhongpu Xia, Qiang Chen, et al. Dreamerad: Efficient reinforcement learning via latent world model for autonomous driving.arXiv preprint arXiv:2603.24587, 2026

work page arXiv 2026
[46]

4d latent world model for robot planning.OpenReview preprint, 2026

Zhiyi Li, Peilin Wu, Xiaoshen Han, Ruojin Cai, and Yilun Du. 4d latent world model for robot planning.OpenReview preprint, 2026

2026
[47]

Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

2020
[48]

High- resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High- resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022

2022
[49]

Video diffusion models.Advances in neural information processing systems, 35:8633–8646, 2022

Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video diffusion models.Advances in neural information processing systems, 35:8633–8646, 2022. 12

2022
[50]

Imagen Video: High Definition Video Generation with Diffusion Models

Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P Kingma, Ben Poole, Mohammad Norouzi, David J Fleet, et al. Imagen video: High definition video generation with diffusion models.arXiv preprint arXiv:2210.02303, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[51]

Align your latents: High-resolution video synthesis with latent diffusion models

Andreas Blattmann, Robin Rombach, Huan Ling, Tim Dockhorn, Seung Wook Kim, Sanja Fidler, and Karsten Kreis. Align your latents: High-resolution video synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22563–22575, 2023

2023
[52]

AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning

Yuwei Guo, Ceyuan Yang, Anyi Rao, Zhengyang Liang, Yaohui Wang, Yu Qiao, Maneesh Agrawala, Dahua Lin, and Bo Dai. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning.arXiv preprint arXiv:2307.04725, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[53]

Open-Sora: Democratizing Efficient Video Production for All

Zangwei Zheng, Xiangyu Peng, Tianji Yang, Chenhui Shen, Shenggui Li, Hongxin Liu, Yukun Zhou, Tianyi Li, and Yang You. Open-sora: Democratizing efficient video production for all. arXiv preprint arXiv:2412.20404, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[54]

Vid2world: Crafting video diffusion models to interactive world models,

Siqiao Huang, Jialong Wu, Qixing Zhou, Shangchen Miao, and Mingsheng Long. Vid2world: Crafting video diffusion models to interactive world models.arXiv preprint arXiv:2505.14357, 2025

work page arXiv 2025
[55]

World Action Models are Zero-shot Policies

Seonghyeon Ye, Yunhao Ge, Kaiyuan Zheng, Shenyuan Gao, Sihyun Yu, George Kurian, Suneel Indupuru, You Liang Tan, Chuning Zhu, Jiannan Xiang, et al. World action models are zero-shot policies.arXiv preprint arXiv:2602.15922, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[56]

Aether: Geometric-aware unified world modeling

Haoyi Zhu, Yifan Wang, Jianjun Zhou, Wenzheng Chang, Yang Zhou, Zizun Li, Junyi Chen, Chunhua Shen, Jiangmiao Pang, and Tong He. Aether: Geometric-aware unified world modeling. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 8535– 8546, 2025

2025
[57]

Language-conditioned world modeling for visual navigation.arXiv preprint arXiv:2603.26741, 2026

Yifei Dong, Fengyi Wu, Yilong Dai, Lingdong Kong, Guangyu Chen, Xu Zhu, Qiyu Hu, Tianyu Wang, Johnalbert Garnica, Feng Liu, et al. Language-conditioned world modeling for visual navigation.arXiv preprint arXiv:2603.26741, 2026

work page arXiv 2026
[58]

Learning Interactive Real-World Simulators

Mengjiao Yang, Yilun Du, Kamyar Ghasemipour, Jonathan Tompson, Dale Schuurmans, and Pieter Abbeel. Learning interactive real-world simulators.arXiv preprint arXiv:2310.06114, 1(2):6, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[59]

Unified world models: Memory-augmented planning and foresight for visual navigation.arXiv preprint arXiv:2510.08713, 2025

Yifei Dong, Fengyi Wu, Guangyu Chen, Zhi-Qi Cheng, Qiyu Hu, Yuxuan Zhou, Jingdong Sun, Jun-Yan He, Qi Dai, and Alexander G Hauptmann. Unified world models: Memory-augmented planning and foresight for visual navigation.arXiv preprint arXiv:2510.08713, 2025

work page arXiv 2025
[60]

Mowm: Mixture-of-world-models for embodied planning via latent-to-pixel feature modulation.arXiv preprint arXiv:2509.21797, 2025

Yangcheng Yu, Xin Jin, Yu Shang, Xin Zhang, Haisheng Su, Wei Wu, and Yong Li. Mowm: Mixture-of-world-models for embodied planning via latent-to-pixel feature modulation.arXiv preprint arXiv:2509.21797, 2025

work page arXiv 2025
[61]

Gigaworld-policy: An efficient action-centered world–action model.arXiv preprint arXiv:2603.17240, 2026

Angen Ye, Boyuan Wang, Chaojun Ni, Guan Huang, Guosheng Zhao, Hao Li, Hengtao Li, Jie Li, Jindi Lv, Jingyu Liu, et al. Gigaworld-policy: An efficient action-centered world–action model.arXiv preprint arXiv:2603.17240, 2026

work page arXiv 2026
[62]

Towards high-consistency embodied world model with multi-view trajectory videos.arXiv preprint arXiv:2511.12882, 2025

Taiyi Su, Jian Zhu, Yaxuan Li, Chong Ma, Jianjun Zhang, Zitai Huang, Hanli Wang, and Yi Xu. Towards high-consistency embodied world model with multi-view trajectory videos.arXiv preprint arXiv:2511.12882, 2025

work page arXiv 2025
[63]

Worldsimbench: Towards video generation models as world simulators.arXiv preprint arXiv:2410.18072, 2024

Yiran Qin, Zhelun Shi, Jiwen Yu, Xijun Wang, Enshen Zhou, Lijun Li, Zhenfei Yin, Xihui Liu, Lu Sheng, Jing Shao, et al. Worldsimbench: Towards video generation models as world simulators.arXiv preprint arXiv:2410.18072, 2024

work page arXiv 2024
[64]

Wow, wo, val! a comprehensive embodied world model evaluation turing test.arXiv preprint arXiv:2601.04137, 2026

Chun-Kai Fan, Xiaowei Chi, Xiaozhu Ju, Hao Li, Yong Bao, Yu-Kai Wang, Lizhang Chen, Zhiyuan Jiang, Kuangzhi Ge, Ying Li, et al. Wow, wo, val! a comprehensive embodied world model evaluation turing test.arXiv preprint arXiv:2601.04137, 2026. 13

work page arXiv 2026
[65]

Worldarena: A unified benchmark for evaluating perception and functional utility of embodied world models.arXiv preprint arXiv:2602.08971, 2026

Yu Shang, Zhuohang Li, Yiding Ma, Weikang Su, Xin Jin, Ziyou Wang, Lei Jin, Xin Zhang, Yinzhou Tang, Haisheng Su, et al. Worldarena: A unified benchmark for evaluating perception and functional utility of embodied world models.arXiv preprint arXiv:2602.08971, 2026

work page arXiv 2026
[66]

WorldLens: Full-Spectrum Evaluations of Driving World Models in Real World

Ao Liang, Lingdong Kong, Tianyi Yan, Hongsi Liu, Wesley Yang, Ziqi Huang, Wei Yin, Jialong Zuo, Yixuan Hu, Dekai Zhu, et al. Worldlens: Full-spectrum evaluations of driving world models in real world.arXiv preprint arXiv:2512.10958, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[67]

Mind: Benchmarking memory consistency and action control in world models.arXiv preprint arXiv:2602.08025, 2026

Yixuan Ye, Xuanyu Lu, Yuxin Jiang, Yuchao Gu, Rui Zhao, Qiwei Liang, Jiachun Pan, Fengda Zhang, Weijia Wu, and Alex Jinpeng Wang. Mind: Benchmarking memory consistency and action control in world models.arXiv preprint arXiv:2602.08025, 2026

work page arXiv 2026
[68]

LoopNav: Benchmarking Spatial Consistency in World Models

Kewei Lian, Shaofei Cai, Yilun Du, and Yitao Liang. Toward memory-aided world models: Benchmarking via spatial consistency.arXiv preprint arXiv:2505.22976, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[69]

Toward Consistent World Models with Multi-Token Prediction and Latent Semantic Enhancement

Qimin Zhong, Hao Liao, Haiming Qin, Mingyang Zhou, Rui Mao, Wei Chen, and Naipeng Chao. Toward consistent world models with multi-token prediction and latent semantic enhancement. arXiv preprint arXiv:2604.06155, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[70]

Wildworld: A large-scale dataset for dynamic world modeling with actions and explicit state toward generative arpg.arXiv preprint arXiv:2603.23497, 2026

Zhen Li, Zian Meng, Shuwei Shi, Wenshuo Peng, Yuwei Wu, Bo Zheng, Chuanhao Li, and Kaipeng Zhang. Wildworld: A large-scale dataset for dynamic world modeling with actions and explicit state toward generative arpg.arXiv preprint arXiv:2603.23497, 2026

work page arXiv 2026
[71]

Mwm: Mobile world models for action-conditioned consistent prediction.arXiv preprint arXiv:2603.07799, 2026

Han Yan, Zishang Xiang, Zeyu Zhang, and Hao Tang. Mwm: Mobile world models for action-conditioned consistent prediction.arXiv preprint arXiv:2603.07799, 2026

work page arXiv 2026
[72]

Generative modeling of molecular dynamics trajectories.Advances in Neural Information Processing Systems, 37:40534– 40564, 2024

Bowen Jing, Hannes Stärk, Tommi Jaakkola, and Bonnie Berger. Generative modeling of molecular dynamics trajectories.Advances in Neural Information Processing Systems, 37:40534– 40564, 2024

2024
[73]

Scalable emulation of protein equilibrium ensembles with generative deep learning.Science, 389(6761):eadv9817, 2025

Sarah Lewis, Tim Hempel, José Jiménez-Luna, Michael Gastegger, Yu Xie, Andrew YK Foong, Victor García Satorras, Osama Abdin, Bastiaan S Veeling, Iryna Zaporozhets, et al. Scalable emulation of protein equilibrium ensembles with generative deep learning.Science, 389(6761):eadv9817, 2025

2025
[74]

Beyond ensembles: Simulating all-atom protein dynamics in a learned latent space.arXiv preprint arXiv:2509.02196, 2025

Aditya Sengar, Jiying Zhang, Pierre Vandergheynst, and Patrick Barth. Beyond ensembles: Simulating all-atom protein dynamics in a learned latent space.arXiv preprint arXiv:2509.02196, 2025

work page arXiv 2025
[75]

Conditional diffusion with locality-aware modal alignment for generating diverse protein conformational ensembles.Nature Machine Intelligence, pages 1–20, 2026

Baoli Wang, Chenglin Wang, Jingyang Chen, Danlin Liu, Changzhi Sun, Jie Zhang, Kai Zhang, and Honglin Li. Conditional diffusion with locality-aware modal alignment for generating diverse protein conformational ensembles.Nature Machine Intelligence, pages 1–20, 2026

2026
[76]

Pathdiffusion: modeling protein folding pathway using evolution-guided diffusion.bioRxiv, pages 2026–01, 2026

Kailong Zhao, Chenxiao Xiang, Bin Cheng, Yunyun Shen, Wenkai Wang, Shuyun Chen, Baoquan Su, Guijun Zhang, Zhenling Peng, and Jianyi Yang. Pathdiffusion: modeling protein folding pathway using evolution-guided diffusion.bioRxiv, pages 2026–01, 2026

2026
[77]

Tensor field networks: Rotation- and translation-equivariant neural networks for 3D point clouds

Nathaniel Thomas, Tess Smidt, Steven Kearnes, Lusann Yang, Li Li, Kai Kohlhoff, and Patrick Riley. Tensor field networks: Rotation-and translation-equivariant neural networks for 3d point clouds.arXiv preprint arXiv:1802.08219, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[78]

E (n) equivariant graph neural networks

Vıctor Garcia Satorras, Emiel Hoogeboom, and Max Welling. E (n) equivariant graph neural networks. InInternational conference on machine learning, pages 9323–9332. PMLR, 2021

2021
[79]

Equivariant diffusion for molecule generation in 3d

Emiel Hoogeboom, Vıctor Garcia Satorras, Clément Vignac, and Max Welling. Equivariant diffusion for molecule generation in 3d. InInternational conference on machine learning, pages 8867–8887. PMLR, 2022

2022
[80]

Geodiff: A geo- metric diffusion model for molecular conformation generation.arXiv preprint arXiv:2203.02923, 2022

Minkai Xu, Lantao Yu, Yang Song, Chence Shi, Stefano Ermon, and Jian Tang. Geodiff: A geo- metric diffusion model for molecular conformation generation.arXiv preprint arXiv:2203.02923, 2022. 14

work page arXiv 2022

Showing first 80 references.

[1] [1]

Llm-planner: Few-shot grounded planning for embodied agents with large language models

Chan Hee Song, Jiaman Wu, Clayton Washington, Brian M Sadler, Wei-Lun Chao, and Yu Su. Llm-planner: Few-shot grounded planning for embodied agents with large language models. In Proceedings of the IEEE/CVF international conference on computer vision, pages 2998–3009, 2023

2023

[2] [2]

Grhp: Graph- fused hierarchical planning for embodied long-horizon robotic task.Engineering Applications of Artificial Intelligence, 165:113413, 2026

Xiaodong Li, Guohui Tian, Yongcheng Cui, Xuyang Shao, and Zhiwei Wang. Grhp: Graph- fused hierarchical planning for embodied long-horizon robotic task.Engineering Applications of Artificial Intelligence, 165:113413, 2026

2026

[3] [3]

Embodied Task Planning via Graph-Informed Action Generation with Large Language Models

Xiang Li, Ning Yan, and Masood Mortazavi. Embodied task planning via graph-informed action generation with large language model.arXiv preprint arXiv:2601.21841, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[4] [4]

Embodied agent interface: Benchmarking llms for embodied decision making.Advances in Neural Information Processing Systems, 37:100428–100534, 2024

Manling Li, Shiyu Zhao, Qineng Wang, Kangrui Wang, Yu Zhou, Sanjana Srivastava, Cem Gokmen, Tony Lee, Li E Li, Ruohan Zhang, et al. Embodied agent interface: Benchmarking llms for embodied decision making.Advances in Neural Information Processing Systems, 37:100428–100534, 2024

2024

[5] [5]

Vlabench: A large-scale benchmark for language- conditioned robotics manipulation with long-horizon reasoning tasks

Shiduo Zhang, Zhe Xu, Peiju Liu, Xiaopeng Yu, Yuan Li, Qinghui Gao, Zhaoye Fei, Zhangyue Yin, Zuxuan Wu, Yu-Gang Jiang, et al. Vlabench: A large-scale benchmark for language- conditioned robotics manipulation with long-horizon reasoning tasks. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 11142–11152, 2025

2025

[6] [6]

Large model empowered embodied ai: A survey on decision-making and embodied learning

Wenlong Liang, Rui Zhou, Yang Ma, Bing Zhang, Songlin Li, Yijia Liao, and Ping Kuang. Large model empowered embodied ai: A survey on decision-making and embodied learning. arXiv preprint arXiv:2508.10399, 2025

work page arXiv 2025

[7] [7]

Meia: Multimodal embodied perception and interaction in unknown environments.arXiv preprint arXiv:2402.00290, 2024

Yang Liu, Xinshuai Song, Kaixuan Jiang, Weixing Chen, Jingzhou Luo, Guanbin Li, and Liang Lin. Meia: Multimodal embodied perception and interaction in unknown environments.arXiv preprint arXiv:2402.00290, 2024

work page arXiv 2024

[8] [8]

Aligning cyber space with physical world: A comprehensive survey on embodied ai

Yang Liu, Weixing Chen, Yongjie Bai, Xiaodan Liang, Guanbin Li, Wen Gao, and Liang Lin. Aligning cyber space with physical world: A comprehensive survey on embodied ai. IEEE/ASME Transactions on Mechatronics, 2025

2025

[9] [9]

Beyond the destination: A novel benchmark for exploration-aware embodied question answering

Kaixuan Jiang, Yang Liu, Weixing Chen, Jingzhou Luo, Ziliang Chen, Ling Pan, Guanbin Li, and Liang Lin. Beyond the destination: A novel benchmark for exploration-aware embodied question answering. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 9091–9101, 2025

2025

[10] [10]

Video diffusion models.Advances in neural information processing systems, 35:8633–8646, 2022

Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video diffusion models.Advances in neural information processing systems, 35:8633–8646, 2022

2022

[11] [11]

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[12] [12]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[13] [13]

Navigation world models

Amir Bar, Gaoyue Zhou, Danny Tran, Trevor Darrell, and Yann LeCun. Navigation world models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 15791–15801, 2025

2025

[14] [14]

Rapid exploration for open-world navigation with latent goal models.arXiv preprint arXiv:2104.05859, 2021

Dhruv Shah, Benjamin Eysenbach, Gregory Kahn, Nicholas Rhinehart, and Sergey Levine. Rapid exploration for open-world navigation with latent goal models.arXiv preprint arXiv:2104.05859, 2021

work page arXiv 2021

[15] [15]

Socially compliant navigation dataset (scand): A large-scale dataset of demonstrations for social navigation.IEEE Robotics and Automation Letters, 7(4):11807–11814, 2022

Haresh Karnan, Anirudh Nair, Xuesu Xiao, Garrett Warnell, Sören Pirk, Alexander Toshev, Justin Hart, Joydeep Biswas, and Peter Stone. Socially compliant navigation dataset (scand): A large-scale dataset of demonstrations for social navigation.IEEE Robotics and Automation Letters, 7(4):11807–11814, 2022. 10

2022

[16] [16]

Sacson: Scalable autonomous control for social navigation.IEEE Robotics and Automation Letters, 9(1):49–56, 2023

Noriaki Hirose, Dhruv Shah, Ajay Sridhar, and Sergey Levine. Sacson: Scalable autonomous control for social navigation.IEEE Robotics and Automation Letters, 9(1):49–56, 2023

2023

[17] [17]

Diffusion for world modeling: Visual details matter in atari.Advances in Neural Information Processing Systems, 37:58757–58791, 2024

Eloi Alonso, Adam Jelley, Vincent Micheli, Anssi Kanervisto, Amos Storkey, Tim Pearce, and François Fleuret. Diffusion for world modeling: Visual details matter in atari.Advances in Neural Information Processing Systems, 37:58757–58791, 2024

2024

[18] [18]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Do- minik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[19] [19]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[20] [20]

Droid-slam: Deep visual slam for monocular, stereo, and rgb-d cameras.Advances in neural information processing systems, 34:16558–16569, 2021

Zachary Teed and Jia Deng. Droid-slam: Deep visual slam for monocular, stereo, and rgb-d cameras.Advances in neural information processing systems, 34:16558–16569, 2021

2021

[21] [21]

The unrea- sonable effectiveness of deep features as a perceptual metric

Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unrea- sonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 586–595, 2018

2018

[22] [22]

Dreamsim: Learning new dimensions of human visual similarity using synthetic data.Advances in Neural Information Processing Systems, 36:50742–50768, 2023

Stephanie Fu, Netanel Tamir, Shobhita Sundaram, Lucy Chai, Richard Zhang, Tali Dekel, and Phillip Isola. Dreamsim: Learning new dimensions of human visual similarity using synthetic data.Advances in Neural Information Processing Systems, 36:50742–50768, 2023

2023

[23] [23]

Nomad: Goal masked diffusion policies for navigation and exploration

Ajay Sridhar, Dhruv Shah, Catherine Glossop, and Sergey Levine. Nomad: Goal masked diffusion policies for navigation and exploration. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 63–70. IEEE, 2024

2024

[24] [24]

Goal-conditioned reinforcement learning for data-driven maritime navigation

Vaishnav Vaidheeswaran, Dilith Jayakody, Samruddhi Mulay, Anand Lo, Md Mahbub Alam, and Gabriel Spadon. Goal-conditioned reinforcement learning for data-driven maritime navigation. arXiv preprint arXiv:2509.01838, 2025

work page arXiv 2025

[25] [25]

An efficient and multi-modal navigation system with one-step world model.arXiv preprint arXiv:2601.12277, 2026

Wangtian Shen, Ziyang Meng, Jinming Ma, Mingliang Zhou, and Diyun Xiang. An efficient and multi-modal navigation system with one-step world model.arXiv preprint arXiv:2601.12277, 2026

work page arXiv 2026

[26] [26]

Vision-and-language navi- gation: A survey of tasks, methods, and future directions

Jing Gu, Eliana Stefani, Qi Wu, Jesse Thomason, and Xin Wang. Vision-and-language navi- gation: A survey of tasks, methods, and future directions. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7606–7623, 2022

2022

[27] [27]

Navgpt: Explicit reasoning in vision-and-language navigation with large language models

Gengze Zhou, Yicong Hong, and Qi Wu. Navgpt: Explicit reasoning in vision-and-language navigation with large language models. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 7641–7649, 2024

2024

[28] [28]

Profocus: Proactive perception and focused reasoning in vision-and-language navigation.arXiv preprint arXiv:2603.05530, 2026

Wei Xue, Mingcheng Li, Xuecheng Wu, Jingqun Tang, Dingkang Yang, and Lihua Zhang. Profocus: Proactive perception and focused reasoning in vision-and-language navigation.arXiv preprint arXiv:2603.05530, 2026

work page arXiv 2026

[29] [29]

Effonav: An effective foundation- model-based visual navigation approach in challenging environment.IEEE Robotics and Automation Letters, 2025

Wangtian Shen, Pengfei Gu, Haijian Qin, and Ziyang Meng. Effonav: An effective foundation- model-based visual navigation approach in challenging environment.IEEE Robotics and Automation Letters, 2025

2025

[30] [30]

Foundation-model-based action selection for behavior trees in navigation

Michele Moriconi, Stefan Laible, and Carmine Recchiuto. Foundation-model-based action selection for behavior trees in navigation. In2025 European Conference on Mobile Robots (ECMR), pages 1–7. IEEE, 2025

2025

[31] [31]

Foundation model driven robotics: A compre- hensive review.arXiv preprint arXiv:2507.10087, 2025

Muhammad Tayyab Khan and Ammar Waheed. Foundation model driven robotics: A compre- hensive review.arXiv preprint arXiv:2507.10087, 2025

work page arXiv 2025

[32] [32]

Gnm: A general navigation model to drive any robot

Dhruv Shah, Ajay Sridhar, Arjun Bhorkar, Noriaki Hirose, and Sergey Levine. Gnm: A general navigation model to drive any robot. In2023 IEEE International Conference on Robotics and Automation (ICRA), pages 7226–7233. IEEE, 2023. 11

2023

[33] [33]

Vint: A foundation model for visual navigation.arXiv preprint arXiv:2306.14846, 2023

Dhruv Shah, Ajay Sridhar, Nitish Dashora, Kyle Stachowicz, Kevin Black, Noriaki Hi- rose, and Sergey Levine. Vint: A foundation model for visual navigation.arXiv preprint arXiv:2306.14846, 2023

work page arXiv 2023

[34] [34]

Towards long-horizon vision-language navigation: Platform, benchmark and method

Xinshuai Song, Weixing Chen, Yang Liu, Vincent Chan, Guanbin Li, and Liang Lin. Towards long-horizon vision-language navigation: Platform, benchmark and method. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

2025

[35] [35]

Embodied navigation foundation model.arXiv preprint arXiv:2509.12129, 2025

Jiazhao Zhang, Anqi Li, Yunpeng Qi, Minghan Li, Jiahang Liu, Shaoan Wang, Haoran Liu, Gengze Zhou, Yuze Wu, Xingxing Li, et al. Embodied navigation foundation model.arXiv preprint arXiv:2509.12129, 2025

work page arXiv 2025

[36] [36]

Vln-r1: Vision-language navigation via reinforcement fine-tuning.arXiv preprint arXiv:2506.17221, 2025

Zhangyang Qi, Zhixiong Zhang, Yizhou Yu, Jiaqi Wang, and Hengshuang Zhao. Vln-r1: Vision-language navigation via reinforcement fine-tuning.arXiv preprint arXiv:2506.17221, 2025

work page arXiv 2025

[37] [37]

ImagineNav++: Prompting Vision-Language Models as Embodied Navigator through Scene Imagination

Teng Wang, Xinxin Zhao, Wenzhe Cai, and Changyin Sun. Imaginenav++: Prompting vision-language models as embodied navigator through scene imagination.arXiv preprint arXiv:2512.17435, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[38] [38]

Vld: Visual language goal distance for reinforce- ment learning navigation.arXiv preprint arXiv:2512.07976, 2025

Lazar Milikic, Manthan Patel, and Jonas Frey. Vld: Visual language goal distance for reinforce- ment learning navigation.arXiv preprint arXiv:2512.07976, 2025

work page arXiv 2025

[39] [39]

Can Vision Foundation Models Navigate? Zero-Shot Real-World Evaluation and Lessons Learned

Maeva Guerrier, Karthik Soma, Jana Pavlasek, and Giovanni Beltrame. Can vision founda- tion models navigate? zero-shot real-world evaluation and lessons learned.arXiv preprint arXiv:2603.25937, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[40] [40]

Navforesee: A unified vision-language world model for hierarchi- cal planning and dual-horizon navigation prediction

Fei Liu, Shichao Xie, Minghua Luo, Zedong Chu, Junjun Hu, Xiaolong Wu, and Mu Xu. Navforesee: A unified vision-language world model for hierarchical planning and dual-horizon navigation prediction.arXiv preprint arXiv:2512.01550, 2025

work page arXiv 2025

[41] [41]

Dream to Control: Learning Behaviors by Latent Imagination

Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Mohammad Norouzi. Dream to control: Learning behaviors by latent imagination.arXiv preprint arXiv:1912.01603, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1912

[42] [42]

Mastering Diverse Domains through World Models

Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse domains through world models.arXiv preprint arXiv:2301.04104, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[43] [43]

Dreaming: Model-based reinforcement learning by latent imagination without reconstruction

Masashi Okada and Tadahiro Taniguchi. Dreaming: Model-based reinforcement learning by latent imagination without reconstruction. In2021 ieee international conference on robotics and automation (icra), pages 4209–4215. IEEE, 2021

2021

[44] [44]

Model-based reinforcement learning via imagination with derived memory.Advances in Neural Information Processing Systems, 34:9493–9505, 2021

Yao Mu, Yuzheng Zhuang, Bin Wang, Guangxiang Zhu, Wulong Liu, Jianyu Chen, Ping Luo, Shengbo Li, Chongjie Zhang, and Jianye Hao. Model-based reinforcement learning via imagination with derived memory.Advances in Neural Information Processing Systems, 34:9493–9505, 2021

2021

[45] [45]

Dreamerad: Efficient reinforcement learning via latent world model for autonomous driving.arXiv preprint arXiv:2603.24587, 2026

Pengxuan Yang, Yupeng Zheng, Deheng Qian, Zebin Xing, Qichao Zhang, Linbo Wang, Yichen Zhang, Shaoyu Guo, Zhongpu Xia, Qiang Chen, et al. Dreamerad: Efficient reinforcement learning via latent world model for autonomous driving.arXiv preprint arXiv:2603.24587, 2026

work page arXiv 2026

[46] [46]

4d latent world model for robot planning.OpenReview preprint, 2026

Zhiyi Li, Peilin Wu, Xiaoshen Han, Ruojin Cai, and Yilun Du. 4d latent world model for robot planning.OpenReview preprint, 2026

2026

[47] [47]

Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

2020

[48] [48]

High- resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High- resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022

2022

[49] [49]

Video diffusion models.Advances in neural information processing systems, 35:8633–8646, 2022

Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video diffusion models.Advances in neural information processing systems, 35:8633–8646, 2022. 12

2022

[50] [50]

Imagen Video: High Definition Video Generation with Diffusion Models

Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P Kingma, Ben Poole, Mohammad Norouzi, David J Fleet, et al. Imagen video: High definition video generation with diffusion models.arXiv preprint arXiv:2210.02303, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[51] [51]

Align your latents: High-resolution video synthesis with latent diffusion models

Andreas Blattmann, Robin Rombach, Huan Ling, Tim Dockhorn, Seung Wook Kim, Sanja Fidler, and Karsten Kreis. Align your latents: High-resolution video synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22563–22575, 2023

2023

[52] [52]

AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning

Yuwei Guo, Ceyuan Yang, Anyi Rao, Zhengyang Liang, Yaohui Wang, Yu Qiao, Maneesh Agrawala, Dahua Lin, and Bo Dai. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning.arXiv preprint arXiv:2307.04725, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[53] [53]

Open-Sora: Democratizing Efficient Video Production for All

Zangwei Zheng, Xiangyu Peng, Tianji Yang, Chenhui Shen, Shenggui Li, Hongxin Liu, Yukun Zhou, Tianyi Li, and Yang You. Open-sora: Democratizing efficient video production for all. arXiv preprint arXiv:2412.20404, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[54] [54]

Vid2world: Crafting video diffusion models to interactive world models,

Siqiao Huang, Jialong Wu, Qixing Zhou, Shangchen Miao, and Mingsheng Long. Vid2world: Crafting video diffusion models to interactive world models.arXiv preprint arXiv:2505.14357, 2025

work page arXiv 2025

[55] [55]

World Action Models are Zero-shot Policies

Seonghyeon Ye, Yunhao Ge, Kaiyuan Zheng, Shenyuan Gao, Sihyun Yu, George Kurian, Suneel Indupuru, You Liang Tan, Chuning Zhu, Jiannan Xiang, et al. World action models are zero-shot policies.arXiv preprint arXiv:2602.15922, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[56] [56]

Aether: Geometric-aware unified world modeling

Haoyi Zhu, Yifan Wang, Jianjun Zhou, Wenzheng Chang, Yang Zhou, Zizun Li, Junyi Chen, Chunhua Shen, Jiangmiao Pang, and Tong He. Aether: Geometric-aware unified world modeling. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 8535– 8546, 2025

2025

[57] [57]

Language-conditioned world modeling for visual navigation.arXiv preprint arXiv:2603.26741, 2026

Yifei Dong, Fengyi Wu, Yilong Dai, Lingdong Kong, Guangyu Chen, Xu Zhu, Qiyu Hu, Tianyu Wang, Johnalbert Garnica, Feng Liu, et al. Language-conditioned world modeling for visual navigation.arXiv preprint arXiv:2603.26741, 2026

work page arXiv 2026

[58] [58]

Learning Interactive Real-World Simulators

Mengjiao Yang, Yilun Du, Kamyar Ghasemipour, Jonathan Tompson, Dale Schuurmans, and Pieter Abbeel. Learning interactive real-world simulators.arXiv preprint arXiv:2310.06114, 1(2):6, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[59] [59]

Unified world models: Memory-augmented planning and foresight for visual navigation.arXiv preprint arXiv:2510.08713, 2025

Yifei Dong, Fengyi Wu, Guangyu Chen, Zhi-Qi Cheng, Qiyu Hu, Yuxuan Zhou, Jingdong Sun, Jun-Yan He, Qi Dai, and Alexander G Hauptmann. Unified world models: Memory-augmented planning and foresight for visual navigation.arXiv preprint arXiv:2510.08713, 2025

work page arXiv 2025

[60] [60]

Mowm: Mixture-of-world-models for embodied planning via latent-to-pixel feature modulation.arXiv preprint arXiv:2509.21797, 2025

Yangcheng Yu, Xin Jin, Yu Shang, Xin Zhang, Haisheng Su, Wei Wu, and Yong Li. Mowm: Mixture-of-world-models for embodied planning via latent-to-pixel feature modulation.arXiv preprint arXiv:2509.21797, 2025

work page arXiv 2025

[61] [61]

Gigaworld-policy: An efficient action-centered world–action model.arXiv preprint arXiv:2603.17240, 2026

Angen Ye, Boyuan Wang, Chaojun Ni, Guan Huang, Guosheng Zhao, Hao Li, Hengtao Li, Jie Li, Jindi Lv, Jingyu Liu, et al. Gigaworld-policy: An efficient action-centered world–action model.arXiv preprint arXiv:2603.17240, 2026

work page arXiv 2026

[62] [62]

Towards high-consistency embodied world model with multi-view trajectory videos.arXiv preprint arXiv:2511.12882, 2025

Taiyi Su, Jian Zhu, Yaxuan Li, Chong Ma, Jianjun Zhang, Zitai Huang, Hanli Wang, and Yi Xu. Towards high-consistency embodied world model with multi-view trajectory videos.arXiv preprint arXiv:2511.12882, 2025

work page arXiv 2025

[63] [63]

Worldsimbench: Towards video generation models as world simulators.arXiv preprint arXiv:2410.18072, 2024

Yiran Qin, Zhelun Shi, Jiwen Yu, Xijun Wang, Enshen Zhou, Lijun Li, Zhenfei Yin, Xihui Liu, Lu Sheng, Jing Shao, et al. Worldsimbench: Towards video generation models as world simulators.arXiv preprint arXiv:2410.18072, 2024

work page arXiv 2024

[64] [64]

Wow, wo, val! a comprehensive embodied world model evaluation turing test.arXiv preprint arXiv:2601.04137, 2026

Chun-Kai Fan, Xiaowei Chi, Xiaozhu Ju, Hao Li, Yong Bao, Yu-Kai Wang, Lizhang Chen, Zhiyuan Jiang, Kuangzhi Ge, Ying Li, et al. Wow, wo, val! a comprehensive embodied world model evaluation turing test.arXiv preprint arXiv:2601.04137, 2026. 13

work page arXiv 2026

[65] [65]

Worldarena: A unified benchmark for evaluating perception and functional utility of embodied world models.arXiv preprint arXiv:2602.08971, 2026

Yu Shang, Zhuohang Li, Yiding Ma, Weikang Su, Xin Jin, Ziyou Wang, Lei Jin, Xin Zhang, Yinzhou Tang, Haisheng Su, et al. Worldarena: A unified benchmark for evaluating perception and functional utility of embodied world models.arXiv preprint arXiv:2602.08971, 2026

work page arXiv 2026

[66] [66]

WorldLens: Full-Spectrum Evaluations of Driving World Models in Real World

Ao Liang, Lingdong Kong, Tianyi Yan, Hongsi Liu, Wesley Yang, Ziqi Huang, Wei Yin, Jialong Zuo, Yixuan Hu, Dekai Zhu, et al. Worldlens: Full-spectrum evaluations of driving world models in real world.arXiv preprint arXiv:2512.10958, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[67] [67]

Mind: Benchmarking memory consistency and action control in world models.arXiv preprint arXiv:2602.08025, 2026

Yixuan Ye, Xuanyu Lu, Yuxin Jiang, Yuchao Gu, Rui Zhao, Qiwei Liang, Jiachun Pan, Fengda Zhang, Weijia Wu, and Alex Jinpeng Wang. Mind: Benchmarking memory consistency and action control in world models.arXiv preprint arXiv:2602.08025, 2026

work page arXiv 2026

[68] [68]

LoopNav: Benchmarking Spatial Consistency in World Models

Kewei Lian, Shaofei Cai, Yilun Du, and Yitao Liang. Toward memory-aided world models: Benchmarking via spatial consistency.arXiv preprint arXiv:2505.22976, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[69] [69]

Toward Consistent World Models with Multi-Token Prediction and Latent Semantic Enhancement

Qimin Zhong, Hao Liao, Haiming Qin, Mingyang Zhou, Rui Mao, Wei Chen, and Naipeng Chao. Toward consistent world models with multi-token prediction and latent semantic enhancement. arXiv preprint arXiv:2604.06155, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[70] [70]

Wildworld: A large-scale dataset for dynamic world modeling with actions and explicit state toward generative arpg.arXiv preprint arXiv:2603.23497, 2026

Zhen Li, Zian Meng, Shuwei Shi, Wenshuo Peng, Yuwei Wu, Bo Zheng, Chuanhao Li, and Kaipeng Zhang. Wildworld: A large-scale dataset for dynamic world modeling with actions and explicit state toward generative arpg.arXiv preprint arXiv:2603.23497, 2026

work page arXiv 2026

[71] [71]

Mwm: Mobile world models for action-conditioned consistent prediction.arXiv preprint arXiv:2603.07799, 2026

Han Yan, Zishang Xiang, Zeyu Zhang, and Hao Tang. Mwm: Mobile world models for action-conditioned consistent prediction.arXiv preprint arXiv:2603.07799, 2026

work page arXiv 2026

[72] [72]

Generative modeling of molecular dynamics trajectories.Advances in Neural Information Processing Systems, 37:40534– 40564, 2024

Bowen Jing, Hannes Stärk, Tommi Jaakkola, and Bonnie Berger. Generative modeling of molecular dynamics trajectories.Advances in Neural Information Processing Systems, 37:40534– 40564, 2024

2024

[73] [73]

Scalable emulation of protein equilibrium ensembles with generative deep learning.Science, 389(6761):eadv9817, 2025

Sarah Lewis, Tim Hempel, José Jiménez-Luna, Michael Gastegger, Yu Xie, Andrew YK Foong, Victor García Satorras, Osama Abdin, Bastiaan S Veeling, Iryna Zaporozhets, et al. Scalable emulation of protein equilibrium ensembles with generative deep learning.Science, 389(6761):eadv9817, 2025

2025

[74] [74]

Beyond ensembles: Simulating all-atom protein dynamics in a learned latent space.arXiv preprint arXiv:2509.02196, 2025

Aditya Sengar, Jiying Zhang, Pierre Vandergheynst, and Patrick Barth. Beyond ensembles: Simulating all-atom protein dynamics in a learned latent space.arXiv preprint arXiv:2509.02196, 2025

work page arXiv 2025

[75] [75]

Conditional diffusion with locality-aware modal alignment for generating diverse protein conformational ensembles.Nature Machine Intelligence, pages 1–20, 2026

Baoli Wang, Chenglin Wang, Jingyang Chen, Danlin Liu, Changzhi Sun, Jie Zhang, Kai Zhang, and Honglin Li. Conditional diffusion with locality-aware modal alignment for generating diverse protein conformational ensembles.Nature Machine Intelligence, pages 1–20, 2026

2026

[76] [76]

Pathdiffusion: modeling protein folding pathway using evolution-guided diffusion.bioRxiv, pages 2026–01, 2026

Kailong Zhao, Chenxiao Xiang, Bin Cheng, Yunyun Shen, Wenkai Wang, Shuyun Chen, Baoquan Su, Guijun Zhang, Zhenling Peng, and Jianyi Yang. Pathdiffusion: modeling protein folding pathway using evolution-guided diffusion.bioRxiv, pages 2026–01, 2026

2026

[77] [77]

Tensor field networks: Rotation- and translation-equivariant neural networks for 3D point clouds

Nathaniel Thomas, Tess Smidt, Steven Kearnes, Lusann Yang, Li Li, Kai Kohlhoff, and Patrick Riley. Tensor field networks: Rotation-and translation-equivariant neural networks for 3d point clouds.arXiv preprint arXiv:1802.08219, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[78] [78]

E (n) equivariant graph neural networks

Vıctor Garcia Satorras, Emiel Hoogeboom, and Max Welling. E (n) equivariant graph neural networks. InInternational conference on machine learning, pages 9323–9332. PMLR, 2021

2021

[79] [79]

Equivariant diffusion for molecule generation in 3d

Emiel Hoogeboom, Vıctor Garcia Satorras, Clément Vignac, and Max Welling. Equivariant diffusion for molecule generation in 3d. InInternational conference on machine learning, pages 8867–8887. PMLR, 2022

2022

[80] [80]

Geodiff: A geo- metric diffusion model for molecular conformation generation.arXiv preprint arXiv:2203.02923, 2022

Minkai Xu, Lantao Yu, Yang Song, Chence Shi, Stefano Ermon, and Jian Tang. Geodiff: A geo- metric diffusion model for molecular conformation generation.arXiv preprint arXiv:2203.02923, 2022. 14

work page arXiv 2022