pith. sign in

arxiv: 2606.27364 · v1 · pith:UXNED3O4new · submitted 2026-06-25 · 💻 cs.CV

PhysiFormer: Learning to Simulate Mechanics in World Space

Pith reviewed 2026-06-26 05:08 UTC · model grok-4.3

classification 💻 cs.CV
keywords diffusion models3D meshphysical simulationtrajectory predictionworld coordinatestransformerrigid elastic motionmulti-object dynamics
0
0 comments X

The pith

Casting vertex trajectory prediction as a single denoising diffusion process directly in world coordinates produces physically plausible 3D mesh motion without explicit inductive biases for rigidity or causality.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

PhysiFormer represents objects as 3D meshes in world coordinates and predicts future vertex trajectories from initial positions, velocities, and material type using a diffusion transformer. It casts the task as one denoising diffusion process rather than building ad-hoc latent spaces or enforcing physical rules. The probabilistic model captures uncertainty and generates diverse futures while generalizing from over 100k simulated trajectories to mixed materials, unseen real-world geometries, and larger object counts. It outperforms autoregressive baselines on trajectory accuracy, rigidity preservation, and momentum consistency. If the approach holds, coordinate-space diffusion offers a route to view-invariant physical world models for robotics and graphics.

Core claim

The central claim is that excellent results on physically-plausible 3D object motion can be obtained without ad-hoc latent spaces or explicit enforcement of rigidity and causality by representing objects as 3D meshes in world coordinates and casting vertex trajectory prediction as a single denoising diffusion process directly in those coordinates.

What carries the argument

A diffusion transformer whose attention is factorised over time, space, and objects, performing denoising directly on vertex positions and velocities in world coordinates.

If this is right

  • The probabilistic formulation enables sampling of multiple diverse yet plausible futures from identical initial conditions.
  • Factorised attention supports permutation-invariant reasoning over multiple objects without explicit object encodings.
  • The model generalises to mixed rigid-elastic interactions and to object counts and real-world geometries not seen during training.
  • Training on simulated data yields substantially higher trajectory accuracy, rigidity preservation, and physical consistency than autoregressive baselines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The world-coordinate formulation could integrate directly with 3D reconstruction systems to enable physical prediction from partial observations.
  • Fine-tuning the trained model on limited real motion-capture data might close the remaining sim-to-real gap for robotic manipulation tasks.
  • Extending the vertex representation to include additional surface properties could allow the same diffusion process to handle contact with deformable environments.

Load-bearing premise

Trajectories generated by the underlying simulator constitute a sufficient and unbiased training distribution that supports generalization to real-world geometries and mixed-material interactions.

What would settle it

A controlled test in which the model produces trajectories that violate momentum conservation or lose rigidity when applied to a real scanned object with material properties and geometry outside the training distribution would falsify the central claim.

Figures

Figures reproduced from arXiv: 2606.27364 by Andrea Vedaldi, Yiming Chen, Yushi Lan.

Figure 1
Figure 1. Figure 1: PHYSIFORMER overview. Given initial per-vertex positions X0 ∈ R N×3 and velocities V0 ∈ R N×3 , and material conditions of (1) rigid, (2) deformable, or (3) mixed, PHYSIFORMER predicts full-sequence future vertex trajectories in a single forward pass, producing physically plausible multi-object dynamics, with mesh topology imposed at inference time. Output can be rendered as 4D mesh motion under arbitrary … view at source ↗
Figure 2
Figure 2. Figure 2: PHYSIFORMER Architecture. During training, input mesh vertex coordinates in R T ×N×3 are projected into hidden dimension D = 1024 via a linear embedder x_embed, and diffused with noise according to the flow-matching schedule. Each noised vertex token is additively conditioned on first-frame position and velocity embeddings (via separate x_embed_cond and v_embed) and a material embedding. We use 16 prepende… view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative comparison of PHYSIFORMER against autoregressive baselines on trained 10k rigid object data. At t = 10, rigidity is not preserved in ΦAR_ctx1, but objects remain rigid across all other models. As t increases, all autoregressive baselines diverge due to error ac￾cumulation: stationary objects fail to remain at rest, objects escape the implicit bounding box, and object shapes deform severely, eve… view at source ↗
Figure 4
Figure 4. Figure 4: PHYSIFORMER generalizes to complex real-world object geometries and object counts not seen during training. Top: Inference on 2 deformable objects (fish and teapot) plus 1 rigid bunny, each with 100 vertices per object. Deformation is most visible for the middle-frame purple teapot. PHYSIFORMER allows mixed-material inference although training only saw uniform material across all objects per scene. Bottom:… view at source ↗
Figure 5
Figure 5. Figure 5: PHYSIFORMER-L-10k generalizes to object geometries and counts not seen during training, shown at t = 0, 15, 30, 48. The first row shows the best AR model (T IEr=1.0) on two unseen convex objects. For the following rows, we have top: two unseen convex objects, middle: seven objects from seen convex templates, exceeding the training maximum of five, bottom: three objects with unseen concave geometry. PHYSIFO… view at source ↗
Figure 6
Figure 6. Figure 6: Mesh templates and real-world geometries used for dataset generation and out-of [PITH_FULL_IMAGE:figures/full_fig_p020_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Physics simulator failure cases occur when boundary contacts are imperfectly resolved, [PITH_FULL_IMAGE:figures/full_fig_p021_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Examples of object overlap during inference [PITH_FULL_IMAGE:figures/full_fig_p021_8.png] view at source ↗
read the original abstract

We present PhysiFormer, a diffusion transformer for physically-plausible 3D object motion. Unlike video world models that operate in view-dependent pixel space, PhysiFormer represents objects as 3D meshes expressed in world coordinates. Given the initial vertex positions and velocities, as well as object material type, rigid or elastic, the model samples future vertex trajectories. While related neural physics approaches build on ad-hoc latent spaces or explicitly enforce rigidity and causality, PhysiFormer shows that excellent results can be obtained without any such inductive biases, by casting vertex trajectory prediction as a single denoising diffusion process directly in world coordinates. The probabilistic formulation captures uncertainty in the learned dynamics, enabling diverse plausible futures from initial conditions, making this framework potentially useful for applications with unobserved uncertainty. The model features attention factorised over time, space, and objects for efficiency, enabling permutation-invariant multi-object reasoning without needing explicit object encoding. Trained on over 100k simulated trajectories, PhysiFormer generates rigid and elastic mechanics, and generalises to mixed-material settings, unseen real-world geometries, and larger object counts. It substantially outperforms autoregressive baselines in trajectory accuracy, rigidity preservation, and momentum-based physical consistency. Our results position coordinate-space diffusion as a promising step toward view-invariant, geometry-aware world modelling for robotics, graphics, and physical design. Visualisations, code, and models are available at https://yimingc9.github.io/physiformer.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper presents PhysiFormer, a diffusion transformer that predicts 3D mesh vertex trajectories directly in world coordinates via a single denoising diffusion process. Given initial positions, velocities, and material type (rigid or elastic), the model generates future trajectories without explicit rigidity, causality, or latent-space inductive biases. Trained on over 100k simulated trajectories, it claims to produce physically consistent rigid and elastic motion, generalize to mixed-material interactions, unseen real-world geometries, and larger object counts, and substantially outperform autoregressive baselines on trajectory accuracy, rigidity preservation, and momentum consistency. The probabilistic formulation allows sampling diverse plausible futures, and the architecture uses factorized attention over time, space, and objects for efficiency and permutation invariance.

Significance. If the central claims hold, the work provides evidence that coordinate-space diffusion can achieve strong physical plausibility and generalization without hand-engineered biases or latent encodings, potentially simplifying geometry-aware world models for robotics, graphics, and design. The public release of code, models, and visualizations is a clear strength for reproducibility.

major comments (2)
  1. [Abstract] Abstract: the claim of 'substantially outperforms autoregressive baselines in trajectory accuracy, rigidity preservation, and momentum-based physical consistency' is presented without any quantitative metrics, baseline definitions, error bars, or table references, leaving the magnitude and statistical reliability of the reported gains unassessable from the provided summary.
  2. [Abstract] Abstract and training description: the generalization claims to 'mixed-material settings, unseen real-world geometries, and larger object counts' rest on training exclusively on 100k single-object rigid/elastic simulator trajectories; no details are given on how the simulator's geometry and material sampling distribution matches real-world variation, so out-of-distribution performance could reflect memorization rather than the diffusion formulation.
minor comments (1)
  1. [Abstract] The abstract states that 'visualisations, code, and models are available' at a URL; the manuscript should include a brief statement on the exact license and reproducibility package contents.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to improve the clarity and specificity of the abstract claims.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim of 'substantially outperforms autoregressive baselines in trajectory accuracy, rigidity preservation, and momentum-based physical consistency' is presented without any quantitative metrics, baseline definitions, error bars, or table references, leaving the magnitude and statistical reliability of the reported gains unassessable from the provided summary.

    Authors: We agree that the abstract would be strengthened by including quantitative references. In the revised version we will update the abstract to cite specific metrics (e.g., relative error reductions) and point to the tables and sections that define the autoregressive baselines, report error bars, and present statistical comparisons. revision: yes

  2. Referee: [Abstract] Abstract and training description: the generalization claims to 'mixed-material settings, unseen real-world geometries, and larger object counts' rest on training exclusively on 100k single-object rigid/elastic simulator trajectories; no details are given on how the simulator's geometry and material sampling distribution matches real-world variation, so out-of-distribution performance could reflect memorization rather than the diffusion formulation.

    Authors: The experiments in Sections 4.3–4.4 evaluate zero-shot generalization on mixed-material interactions, unseen real-world geometries, and larger object counts. We acknowledge that the current manuscript provides limited explicit discussion of the simulator sampling distribution. We will add a dedicated paragraph in the dataset section detailing the geometry and material parameter ranges and will include a brief analysis relating these ranges to real-world variation to better support the generalization claims. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper trains PhysiFormer as a standard denoising diffusion process on externally generated simulation data (over 100k trajectories from rigid/elastic simulators) and evaluates using independent physical-consistency metrics such as trajectory accuracy, rigidity preservation, and momentum conservation. No load-bearing step reduces by construction to a fitted parameter, self-defined quantity, or self-citation chain; the central modeling choice (world-coordinate diffusion without explicit rigidity/causality biases) is an architectural decision whose performance is measured against external baselines and held-out data rather than being tautological with its inputs. Generalization claims rest on empirical results rather than definitional equivalence.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that the 100k simulated trajectories are representative and that the diffusion objective alone is sufficient to recover dynamics; no new physical entities are postulated.

free parameters (1)
  • neural network weights
    All model parameters are fitted to the 100k simulated trajectories.
axioms (1)
  • domain assumption Simulated trajectories accurately capture rigid and elastic mechanics for the objects used in training.
    The model is trained exclusively on simulator output and evaluated on generalisation to real geometries.

pith-pipeline@v0.9.1-grok · 5781 in / 1159 out tokens · 25216 ms · 2026-06-26T05:08:49.900183+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

67 extracted references · 9 canonical work pages · 4 internal anchors

  1. [1]

    Building normalizing flows with stochastic interpolants

    Michael S Albergo and Eric Vanden-Eijnden. Building normalizing flows with stochastic interpolants. InProc. ICLR, 2023

  2. [2]

    Learning rigid dynamics with face interaction graph networks.arXiv preprint arXiv:2212.03574, 2022

    Kelsey R Allen, Yulia Rubanova, Tatiana Lopez-Guevara, William Whitney, Alvaro Sanchez- Gonzalez, Peter Battaglia, and Tobias Pfaff. Learning rigid dynamics with face interaction graph networks.arXiv preprint arXiv:2212.03574, 2022

  3. [3]

    V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

    Mido Assran, Adrien Bardes, David Fan, Quentin Garrido, Russell Howes, Mojtaba, Komeili, Matthew Muckley, Ammar Rizvi, Claire Roberts, Koustuv Sinha, Artem Zholus, Sergio Arnaud, Abha Gejji, Ada Martin, Francois Robert Hogan, Daniel Dugas, Piotr Bojanowski, Vasil Khalidov, Patrick Labatut, Francisco Massa, Marc Szafraniec, Kapil Krishnakumar, Yong Li, Xia...

  4. [4]

    Genesis: A generative and universal physics engine for robotics and beyond,

    Genesis Authors. Genesis: A generative and universal physics engine for robotics and beyond,

  5. [5]

    URLhttps://github.com/Genesis-Embodied-AI/Genesis

  6. [6]

    Large steps in cloth simulation

    David Baraff and Andrew Witkin. Large steps in cloth simulation. InProc. SIGGRAPH, 1998

  7. [7]

    Battaglia, Razvan Pascanu, Matthew Lai, Danilo Jimenez Rezende, and Koray Kavukcuoglu

    Peter W. Battaglia, Razvan Pascanu, Matthew Lai, Danilo Jimenez Rezende, and Koray Kavukcuoglu. Interaction networks for learning about objects, relations and physics. In Proc. NeurIPS, 2016

  8. [8]

    Prediction of aerodynamic flow fields using convolutional neural networks.Computational Mechanics, 64(2), 2019

    Saakaar Bhatnagar, Yaser Afshar, Shaowu Pan, Karthik Duraisamy, and Shailendra Kaushik. Prediction of aerodynamic flow fields using convolutional neural networks.Computational Mechanics, 64(2), 2019

  9. [9]

    Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

    Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Do- minik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, Varun Jampani, and Robin Rombach. Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv.cs, abs/2311.15127, 2023

  10. [10]

    Align your latents: High-resolution video synthesis with latent diffusion models

    Andreas Blattmann, Robin Rombach, Huan Ling, Tim Dockhorn, Seung Wook Kim, Sanja Fidler, and Karsten Kreis. Align your latents: High-resolution video synthesis with latent diffusion models. InProc. CVPR, 2023

  11. [11]

    Projective dynamics: Fusing constraint projections for fast simulation

    Sofien Bouaziz, Sebastian Martin, Tiantian Liu, Ladislav Kavan, and Mark Pauly. Projective dynamics: Fusing constraint projections for fast simulation. InProc. SIGGRAPH, 2014

  12. [12]

    Video generation models as world simulators

    Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, Clarence Ng, Ricky Wang, and Aditya Ramesh. Video generation models as world simulators. Technical report, OpenAI, 2024

  13. [13]

    A discussion of semi-supervised learning and transduction

    Olivier Chapelle, Bernhard Schölkopf, and Alexander Zien. A discussion of semi-supervised learning and transduction. InSemi-Supervised Learning. The MIT Press, 2006

  14. [14]

    Diffusion forcing: Next-token prediction meets full-sequence diffusion

    Boyuan Chen, Yilun Du, Diego Martí, et al. Diffusion forcing: Next-token prediction meets full-sequence diffusion. InNeurIPS, 2024

  15. [15]

    Diffusion forcing: Next-token prediction meets full-sequence diffusion.arXiv, 2407.01392, 2024

    Boyuan Chen, Diego Marti Monso, Yilun Du, Max Simchowitz, Russ Tedrake, and Vincent Sitzmann. Diffusion forcing: Next-token prediction meets full-sequence diffusion.arXiv, 2407.01392, 2024

  16. [16]

    PhysGen3D: Crafting a miniature interactive world from a single image.CVPR, 2025

    Boyuan Chen, Hanxiao Jiang, Shaowei Liu, Saurabh Gupta, Yunzhu Li, Hao Zhao, and Shenlong Wang. PhysGen3D: Crafting a miniature interactive world from a single image.CVPR, 2025

  17. [17]

    Motion 3-to-4: 3D motion reconstruction for 4D synthesis.arXiv, 2601.14253, 2026

    Hongyuan Chen, Xingyu Chen, Youjia Zhang, Zexiang Xu, and Anpei Chen. Motion 3-to-4: 3D motion reconstruction for 4D synthesis.arXiv, 2601.14253, 2026

  18. [18]

    Pybullet, a python module for physics simulation for games, robotics and machine learning, 2016

    Erwin Coumans and Yunfei Bai. Pybullet, a python module for physics simulation for games, robotics and machine learning, 2016. URLhttp://pybullet.org. 11

  19. [19]

    Vision transformers need registers.Proc

    Timothée Darcet, Maxime Oquab, Julien Mairal, and Piotr Bojanowski. Vision transformers need registers.Proc. ICLR, 2024

  20. [20]

    A generalization of transformer networks to graphs

    Vijay Prakash Dwivedi and Xavier Bresson. A generalization of transformer networks to graphs. Proc. AAAI Workshop, 2021

  21. [21]

    arXiv preprint arXiv:2505.19386 (2025) 4

    Nate Gillman, Charles Herrmann, Michael Freeman, Daksh Aggarwal, Evan Luo, Deqing Sun, and Chen Sun. Force prompting: Video generation models can learn and generalize physics-based control signals. InProc. NeurIPS, volume 2505.19386, 2025

  22. [22]

    Hood: Hierarchical graphs for generalized modelling of clothing dynamics

    Artur Grigorev, Michael J Black, and Otmar Hilliges. Hood: Hierarchical graphs for generalized modelling of clothing dynamics. InProc. CVPR, 2023

  23. [23]

    Convolutional neural networks for steady flow approximation

    Xiaoxiao Guo, Wei Li, and Francesco Iorio. Convolutional neural networks for steady flow approximation. InProc. SIGKDD, 2016

  24. [24]

    Query-key normal- ization for transformers.arXiv, 2020

    Alex Henry, Prudhvi Raj Dachapally, Shubham Pawar, and Yuxuan Chen. Query-key normal- ization for transformers.arXiv, 2020

  25. [25]

    simple diffusion: End-to-end diffusion for high resolution images

    Emiel Hoogeboom, Jonathan Heek, and Tim Salimans. simple diffusion: End-to-end diffusion for high resolution images. InProc. ICML, 2023

  26. [26]

    Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion

    Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman. Self forcing: Bridging the train-test gap in autoregressive video diffusion.arXiv, 2506.08009, 2025

  27. [27]

    PhysTwin: Physics-informed reconstruction and simulation of deformable objects from videos

    Hanxiao Jiang, Hao-Yu Hsu, Kaifeng Zhang, Hsin-Ni Yu, Shenlong Wang, and Yunzhu Li. PhysTwin: Physics-informed reconstruction and simulation of deformable objects from videos. InProc. ICCV, 2025

  28. [28]

    A solution for the best rotation to relate two sets of vectors.F oundations of Crystallography, 32(5), 1976

    Wolfgang Kabsch. A solution for the best rotation to relate two sets of vectors.F oundations of Crystallography, 32(5), 1976

  29. [29]

    How far is video generation from world model: A physical law perspective

    Bingyi Kang, Yang Yue, Rui Lu, Zhijie Lin, Yang Zhao, Kaixin Wang, Gao Huang, and Jiashi Feng. How far is video generation from world model: A physical law perspective. InProc. ICML, 2025

  30. [30]

    DINO- foresight: Looking into the future with DINO

    Efstathios Karypidis, Ioannis Kakogeorgiou, Spyros Gidaris, and Nikos Komodakis. DINO- foresight: Looking into the future with DINO. InProc. NeurIPS, 2025

  31. [31]

    3D Gaussian Splatting for real-time radiance field rendering.Proc

    Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, and George Drettakis. 3D Gaussian Splatting for real-time radiance field rendering.Proc. SIGGRAPH, 42(4), 2023

  32. [32]

    What about gravity in video generation? post-training Newton’s laws with verifiable rewards.arXiv, 2512.00425, 2025

    Minh-Quan Le, Yuanzhi Zhu, Vicky Kalogeiton, and Dimitris Samaras. What about gravity in video generation? post-training Newton’s laws with verifiable rewards.arXiv, 2512.00425, 2025

  33. [33]

    Codimensional incremental potential contact

    Minchen Li, Danny M Kaufman, and Chenfanfu Jiang. Codimensional incremental potential contact. InProc. SIGGRAPH, 2021

  34. [34]

    Back to basics: Let denoising generative models denoise

    Tianhong Li and Kaiming He. Back to basics: Let denoising generative models denoise. In CVPR, 2025

  35. [35]

    Learning particle dynamics for manipulating rigid bodies, deformable objects, and fluids

    Yunzhu Li, Jiajun Wu, Russ Tedrake, Joshua B Tenenbaum, and Antonio Torralba. Learning particle dynamics for manipulating rigid bodies, deformable objects, and fluids. InProc. ICLR, 2019

  36. [36]

    Learning visible connectivity dynamics for cloth smoothing

    Xingyu Lin, Yufei Wang, Zixuan Huang, and David Held. Learning visible connectivity dynamics for cloth smoothing. InProc. CoRL, 2021

  37. [37]

    Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv.cs, abs/2210.02747, 2022

  38. [38]

    PhysGen: Rigid-body physics-grounded image-to-video generation

    Shaowei Liu, Zhongzheng Ren, Saurabh Gupta, and Shenlong Wang. PhysGen: Rigid-body physics-grounded image-to-video generation. InProc. ECCV, 2024. 12

  39. [39]

    Flow straight and fast: Learning to generate and transfer data with rectified flow

    Xingchao Liu, Chengyue Gong, and qiang liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. InProc. ICLR, 2023

  40. [40]

    Black, Derek Nowrouzezahrai, Liam Paull, and Weiyang Liu

    Zhen Liu, Yao Feng, Michael J. Black, Derek Nowrouzezahrai, Liam Paull, and Weiyang Liu. Meshdiffusion: Score-based generative 3d mesh modeling. InICLR, 2023

  41. [41]

    Otaduy, and Steve Marschner

    Eder Miguel, Derek Bradley, Bernhard Thomaszewski, Bernd Bickel, Wojciech Matusik, Miguel A. Otaduy, and Steve Marschner. Data-driven estimation of cloth simulation models. In Proc. Eurographics, 2012

  42. [42]

    Srinivasan, Matthew Tancik, Jonathan T

    Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Jonathan T. Barron, Ravi Ramamoorthi, and Ren Ng. NeRF: Representing scenes as neural radiance fields for view synthesis. InProc. ECCV, 2020

  43. [43]

    Flexible neural representation for physics prediction

    Damian Mrowca, Chengxu Zhuang, Elias Wang, Nick Haber, Fei-Fei Li, Josh Tenenbaum, and Daniel L K Yamins. Flexible neural representation for physics prediction. InProc. NeurIPS, 2018

  44. [44]

    Particle-based fluid simulation for interactive applications

    Matthias Müller, David Charypar, and Markus Gross. Particle-based fluid simulation for interactive applications. InProc. Eurographics, 2003

  45. [45]

    Genie 3: A new frontier for world models, 2025

    Jack Parker-Holder and Shlomi Fruchter. Genie 3: A new frontier for world models, 2025. URL https://deepmind.google/blog/genie-3-a-new-frontier-for-world-models/

  46. [46]

    Scalable diffusion models with transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. InProc. ICCV, 2023

  47. [47]

    Battaglia

    Tobias Pfaff, Meire Fortunato, Alvaro Sanchez-Gonzalez, and Peter W. Battaglia. Learning mesh-based simulation with graph networks. InProc. ICLR, 2021

  48. [48]

    Allen, William F

    Yulia Rubanova, Tatiana Lopez-Guevara, Kelsey R. Allen, William F. Whitney, Kimberly Stachenfeld, and Tobias Pfaff. Learning rigid-body simulators over implicit shapes for large- scale scenes and vision. InProc. NeurIPS, volume k, 2024

  49. [49]

    Battaglia

    Alvaro Sanchez-Gonzalez, Jonathan Godwin, Tobias Pfaff, Rex Ying, Jure Leskovec, and Peter W. Battaglia. Learning to simulate complex physics with graph networks. InProc. ICML, 2020

  50. [50]

    The graph neural network model.IEEE Trans

    Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner, and Gabriele Monfardini. The graph neural network model.IEEE Trans. on Neural Networks, 2009

  51. [51]

    Transformer with implicit edges for particle-based physics simulation

    Yidi Shao, Chen Change Loy, and Bo Dai. Transformer with implicit edges for particle-based physics simulation. InProc. ECCV, 2022

  52. [52]

    Self-attention with relative position repre- sentations

    Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. Self-attention with relative position repre- sentations. InProc. NAACL, 2018

  53. [53]

    Noam M. Shazeer. GLU variants improve transformer. InarXiv, 2020

  54. [54]

    Jianlin Su, Murtadha H. M. Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. RoFormer: Enhanced transformer with rotary position embedding.Neurocomputing, 2024

  55. [55]

    Mujoco: A physics engine for model-based control

    Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. InProc.IROS, 2012

  56. [56]

    Lagrangian fluid simulation with continuous convolutions

    Benjamin Ummenhofer, Lukas Prantl, Nils Thuerey, and Vladlen Koltun. Lagrangian fluid simulation with continuous convolutions. InProc. ICLR, 2020

  57. [57]

    A simple approach to nonlinear tensile stiffness for accurate cloth simulation

    Pascal V olino, Nadia Magnenat-Thalmann, and Francois Faure. A simple approach to nonlinear tensile stiffness for accurate cloth simulation. InProc. SIGGRAPH, 2009

  58. [58]

    Integrating physics and topology in neural networks for learning rigid body dynamics.Nature Communications, 16(1), 2025

    Amaury Wei and Olga Fink. Integrating physics and topology in neural networks for learning rigid body dynamics.Nature Communications, 16(1), 2025. 13

  59. [59]

    PhysGaussian: Physics-integrated 3D Gaussians for generative dynamics

    Tianyi Xie, Zeshun Zong, Yuxing Qiu, Xuan Li, Yutao Feng, Yin Yang, and Chenfanfu Jiang. PhysGaussian: Physics-integrated 3D Gaussians for generative dynamics. InProc. CVPR, 2024

  60. [60]

    Learning flexible body collision dynamics with hierarchical contact mesh transformer

    Youn-Yeol Yu, Jeongwhan Choi, Woojin Cho, Kookjin Lee, Nayong Kim, Kiseok Chang, ChangSeung Woo, Ilho Kim, SeokWoo Lee, Joon Young Yang, et al. Learning flexible body collision dynamics with hierarchical contact mesh transformer. InProc. ICLR, 2024

  61. [61]

    RenderFormer: transformer- based neural rendering of triangle meshes with global illumination

    Chong Zeng, Yue Dong, Pieter Peers, Hongzhi Wu, and Xin Tong. RenderFormer: transformer- based neural rendering of triangle meshes with global illumination. InProc. SIGGRAPH, 2025

  62. [62]

    Root mean square layer normalization

    Biao Zhang and Rico Sennrich. Root mean square layer normalization. InProc. NeurIPS, 2019

  63. [63]

    3DShape2VecSet: A 3d shape representation for neural fields and generative diffusion models.ACM Transactions On Graphics (TOG), 42(4):1–16, 2023

    Biao Zhang, Jiapeng Tang, Matthias Niessner, and Peter Wonka. 3DShape2VecSet: A 3d shape representation for neural fields and generative diffusion models.ACM Transactions On Graphics (TOG), 42(4):1–16, 2023

  64. [64]

    3DShape2VecSet: A 3D shape representation for neural fields and generative diffusion models

    Biao Zhang, Jiapeng Tang, Matthias Niessner, and Peter Wonka. 3DShape2VecSet: A 3D shape representation for neural fields and generative diffusion models. InACM Transactions on Graphics, 2023

  65. [65]

    Gaussian variation field diffusion for high-fidelity video-to-4D synthesis

    Bowen Zhang, Sicheng Xu, Chuxin Wang, Jiaolong Yang, Feng Zhao, Dong Chen, and Baining Guo. Gaussian variation field diffusion for high-fidelity video-to-4D synthesis. InProc. ICCV, 2025

  66. [66]

    Dynamic 3D Gaussian tracking for graph- based neural dynamics modeling

    Mingtong Zhang, Kaifeng Zhang, and Yunzhu Li. Dynamic 3D Gaussian tracking for graph- based neural dynamics modeling. InProc. CoRL, 2024

  67. [67]

    Reconstruction and simulation of elastic objects with spring-mass 3D Gaussians

    Licheng Zhong, Hong-Xing Yu, Jiajun Wu, and Yunzhu Li. Reconstruction and simulation of elastic objects with spring-mass 3D Gaussians. InProc. ECCV, 2024. 14 PHYSIFORMER: Learning to Simulate Mechanics in World Space Supplementary Material A Method Continued A.1 PHYSIFORMER Register Tokens.In input data tokenization, we further prepend Nreg = 16 shared, l...