arxiv: 2605.09196 · v1 · submitted 2026-05-09 · 💻 cs.CV · cs.AI· cs.GR· cs.LG· cs.RO

Recognition: 3 theorem links

· Lean Theorem

RigidFormer: Learning Rigid Dynamics using Transformers

Zhiyang Dou , Minghao Guo , Haixu Wu , Doug Roble , Tuur Stuyck , Wojciech Matusik

Authors on Pith no claims yet

Pith reviewed 2026-05-12 02:14 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.GRcs.LGcs.RO

keywords rigid body dynamicstransformerpoint cloudsphysics simulationmesh-free modelingobject-centric architectureKabsch alignment

0 comments

The pith

RigidFormer simulates multi-object rigid-body dynamics from point clouds by advancing objects via compact anchors in a Transformer and projecting updates onto the rigid manifold.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces RigidFormer, an object-centric Transformer that learns mesh-free rigid-body dynamics with controllable time steps. It processes each object through a small set of anchors that receive local vertex features via pooling, uses a specialized rotation-equivariant positional embedding in attention, and applies differentiable Kabsch alignment to keep updates rigid. This setup is claimed to handle discontinuous contacts and long rollouts without needing mesh connectivity or dense vertex interactions. A sympathetic reader would care because the method works directly on raw point inputs, runs faster than mesh baselines, generalizes across resolutions and datasets, and scales beyond 200 objects.

Core claim

RigidFormer reasons at the object level by advancing each object through compact anchors; Anchor-Vertex Pooling enriches anchors with local geometry while avoiding dense vertex-level message passing; Anchor-based RoPE injects anchor geometry into attention in a permutation-equivariant way for objects and invariant way for anchors; and differentiable Kabsch projection enforces rigidity on the predicted updates. On standard benchmarks this yields performance that matches or exceeds mesh-based methods from point inputs, with faster runtime, better generalization to unseen point counts and other datasets, and scaling to more than 200 objects, plus a preliminary extension to articulated bodies.

What carries the argument

The central mechanism is object-centric attention over compact anchors enriched by Anchor-Vertex Pooling, equipped with Anchor-based RoPE for geometry-aware permutation-equivariant processing, followed by differentiable Kabsch alignment that projects updates onto the rigid-body manifold.

If this is right

The model matches or beats mesh-based baselines on standard rigid-dynamics benchmarks while using only point inputs.
Runtime is lower than mesh-based alternatives because computation stays at the anchor and object level rather than vertex level.
Performance holds when tested on point clouds with resolutions never seen during training and when transferred across different datasets.
The architecture scales to scenes containing more than 200 interacting objects.
Treating articulated-body parts as separate objects yields a preliminary command-conditioned extension without changing the core design.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same anchor-plus-projection pattern could let vision pipelines feed raw depth or LiDAR points straight into a physics simulator without an intermediate meshing step.
Because attention operates at object granularity, the method might extend naturally to hybrid scenes that mix rigid and deformable objects by swapping only the projection step.
If anchor count is treated as a hyperparameter, one could test whether increasing anchors per object recovers fine contact details that the current compact representation approximates.

Load-bearing premise

Compact anchors with local pooling and Kabsch projection are enough to capture discontinuous contact forces and limit error buildup over long horizons without dense vertex interactions or mesh topology.

What would settle it

A scene of many small objects in repeated stacking or sliding contact where short-term accuracy matches baselines but long-horizon rollouts diverge visibly from ground truth despite the claimed generalization.

Figures

Figures reproduced from arXiv: 2605.09196 by Doug Roble, Haixu Wu, Minghao Guo, Tuur Stuyck, Wojciech Matusik, Zhiyang Dou.

**Figure 1.** Figure 1: (a) Dynamics modeling from partial point clouds. (b) Object-level interactions reduce the complexity from vertex-level O(N2 V ) to objectlevel O(N2 O). it takes object-level point clouds as input, even partial ones (see [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: RigidFormer Pipeline. Inputs include two recent point-level states, the time-step size, and optional control signals. At time t, we encode each object point cloud into an object token. Object-Level Interaction models direct object–object effects, while Anchor-Object Interaction lets anchors attend to multi-scale object features and cross-object context. Each object’s state is advanced efficiently using a c… view at source ↗

**Figure 3.** Figure 3: Qualitative results from RigidFormer. Meshes are shown only for visualization; our model operates on point inputs. Additional visualization results on the MOVi datasets, including partial point-cloud inputs, are provided in Appendix A. where ωl are log-spaced frequencies, ⊕ denotes concatenation, and the repeated terms form the even–odd channel pairs used by RoPE. The resulting descriptor provides the RoPE… view at source ↗

**Figure 4.** Figure 4: Partial point-cloud rollouts. Three example sequences (rows) where 25% of each object’s points are masked at test time. RigidFormer produces stable rollouts with accurate inter-object contacts. Meshes are shown only for visualization; our model operates on point inputs. 4.2 Ablation Studies Positional Embedding. We compare our Anchor-based Rotary Positional Embedding (ARoPE) with standard sinusoidal PE [37… view at source ↗

**Figure 5.** Figure 5: Scalability and controllability. Left: Simulation with 216, 125, 64 cubes; RigidFormer stays stable as object count grows. Right: Preliminary direction-controlled articulated dynamics. Humanoid: commanded facing and moving directions are in green and red arrows, respectively; Unitree G1: the arrow shows the moving direction. RigidFormer follows heading commands while preserving coherent part-level motion. … view at source ↗

read the original abstract

Learning-based simulation of multi-object rigid-body dynamics remains difficult because contact is discontinuous and errors compound over long horizons. Most existing methods remain tied to mesh connectivity and vertex-level message passing, which limits their applicability to mesh-free inputs such as point clouds and leads to high computational cost. Efficiently modeling high-fidelity rigid-body dynamics from mesh-free representations, therefore, remains challenging. We introduce RigidFormer, an object-centric Transformer-based model that learns mesh-free rigid-body dynamics with controllable integration step sizes. RigidFormer reasons at the object level and advances each object through compact anchors; Anchor-Vertex Pooling enriches these anchors with local vertex features, retaining contact-relevant geometry without dense vertex-level interaction. We propose Anchor-based RoPE to inject anchor geometry into attention while respecting the unordered nature of objects and anchors: object-token processing is permutation-equivariant, and the mean-pooled anchor descriptor is invariant to anchor reindexing while preserving shape extent. RigidFormer further enforces rigidity by projecting updates onto the rigid-body manifold using differentiable Kabsch alignment. On standard benchmarks, RigidFormer outperforms or matches mesh-based baselines using point inputs, runs faster, generalizes to unseen point resolutions and across datasets, and scales to 200+ objects; we also show a preliminary extension to command-conditioned articulated bodies by treating body parts as interacting object-level components.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RigidFormer uses object anchors, local pooling, custom RoPE, and Kabsch projection to enable mesh-free rigid dynamics on point clouds, but the abstract supplies no metrics or ablations to evaluate the performance claims.

read the letter

The key takeaway is that this paper introduces an object-centric Transformer for rigid-body dynamics that works directly on point clouds by using a small set of anchors per object, local pooling to capture geometry, a custom RoPE for equivariance, and a Kabsch projection to keep motions rigid. It claims better or equal performance to mesh-based methods while being faster and scaling to over 200 objects, plus some generalization across point resolutions. What is new here is the specific architecture pattern: Anchor-Vertex Pooling to enrich anchors without dense interactions, Anchor-based RoPE that maintains permutation equivariance for objects and invariance for anchors, and the differentiable Kabsch step to enforce the rigid manifold. These choices let the model avoid mesh connectivity while still trying to handle contact-relevant details. The extension to articulated bodies by treating parts as separate objects is a nice touch. The paper does a good job identifying the core difficulties—discontinuous contacts and error buildup over long horizons—and proposes a way around the mesh requirement that has limited prior work. The claims about running faster and generalizing to unseen resolutions suggest practical advantages if they check out. The main soft spot is the lack of any quantitative results, baseline comparisons, or ablation studies in the abstract. Without those, it's difficult to assess how much the architecture actually improves on contacts or long-term stability. The stress-test point about anchors and mean pooling potentially missing sharp local geometry for force discontinuities is worth checking; the projection corrects the output but the learned update might still suffer if the features are too smoothed. If the full paper has detailed error analysis over time and contact-specific metrics, that would strengthen it considerably. Otherwise the central claims rest on unshown evidence. This work is aimed at researchers in learning-based simulation for robotics and graphics who deal with point cloud or mesh-free inputs. Readers looking for new ways to apply Transformers to physical systems with symmetry constraints would find the design details useful. It deserves serious peer review because the problem is relevant and the proposed components are technically grounded, even if the results need more scrutiny. I recommend putting it through peer review rather than desk rejecting it, as the idea has enough substance to warrant expert feedback on the experiments and potential limitations around contact modeling.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces RigidFormer, an object-centric Transformer architecture for learning rigid-body dynamics directly from unordered point-cloud inputs. It advances each object via a small set of compact anchors, employs Anchor-Vertex Pooling to inject local vertex geometry into the anchors, uses Anchor-based RoPE to encode anchor geometry in a permutation-equivariant manner, and projects updates onto the rigid manifold via differentiable Kabsch alignment. The central empirical claims are that the model matches or exceeds mesh-based baselines on standard benchmarks while running faster, generalizes to unseen point resolutions and across datasets, and scales to scenes with 200+ objects; a preliminary extension to command-conditioned articulated bodies is also presented.

Significance. If the reported performance and generalization results hold under rigorous scrutiny, the work would be a meaningful step toward mesh-free, scalable rigid-body simulation. The combination of object-centric attention, local geometric pooling, and explicit rigidity projection offers a practical alternative to dense vertex message passing, with potential impact on robotics, graphics, and physics-based learning where point-cloud or depth-sensor data predominate. The Kabsch projection is a clean, differentiable mechanism that directly addresses manifold constraints.

major comments (3)

[§3.2] §3.2 (Anchor-Vertex Pooling): the claim that mean-pooled local features around a small set of anchors suffice to capture the high-frequency geometric cues required for discontinuous contact forces is load-bearing for the long-horizon stability and cross-resolution generalization results. Because contacts are triggered by precise local geometry at the exact collision instant, averaging can smooth or omit the necessary discontinuities; the subsequent Kabsch projection corrects only the output state, not the learned force update. The manuscript should supply either (a) an ablation isolating pooling radius and anchor count against contact-rich test cases or (b) a quantitative analysis of force-error distribution at contact events.
[§4] §4 (Experiments): the abstract states that RigidFormer “outperforms or matches mesh-based baselines,” yet the provided text supplies no numerical tables, baseline implementations, or error bars. Without these data it is impossible to assess whether the reported gains are robust to post-hoc hyper-parameter choices or dataset selection. The full experimental section must include (i) per-benchmark quantitative metrics with standard deviations, (ii) ablation tables isolating each proposed component, and (iii) failure-case analysis for long-horizon rollouts.
[§3.3] §3.3 (Anchor-based RoPE): the invariance claim for the mean-pooled anchor descriptor under anchor re-indexing is stated but not formally proven. Because object-token processing must remain permutation-equivariant while the pooled descriptor must be invariant, a short derivation or explicit invariance check under anchor permutation would strengthen the architectural justification.

minor comments (2)

[Abstract / §3] The abstract mentions “controllable integration step sizes” but the method section does not specify how step size is encoded or conditioned; a brief clarification would improve reproducibility.
[Figures] Figure captions and axis labels should explicitly state the number of objects, point resolution, and integration horizon used in each rollout visualization.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below. Revisions will be made to strengthen the manuscript where the concerns identify gaps in justification or presentation.

read point-by-point responses

Referee: [§3.2] §3.2 (Anchor-Vertex Pooling): the claim that mean-pooled local features around a small set of anchors suffice to capture the high-frequency geometric cues required for discontinuous contact forces is load-bearing for the long-horizon stability and cross-resolution generalization results. Because contacts are triggered by precise local geometry at the exact collision instant, averaging can smooth or omit the necessary discontinuities; the subsequent Kabsch projection corrects only the output state, not the learned force update. The manuscript should supply either (a) an ablation isolating pooling radius and anchor count against contact-rich test cases or (b) a quantitative analysis of force-error distribution at contact events.

Authors: We agree that contact discontinuities pose a significant challenge and that mean pooling could in principle attenuate high-frequency cues. Our architecture mitigates this through the combination of local pooling with Anchor-based RoPE (which preserves relative geometry) and the subsequent Kabsch projection. Nevertheless, to provide direct evidence, we will add an ablation varying pooling radius and anchor count on contact-rich subsets, together with a quantitative breakdown of force-prediction error specifically at detected contact instants. revision: yes
Referee: [§4] §4 (Experiments): the abstract states that RigidFormer “outperforms or matches mesh-based baselines,” yet the provided text supplies no numerical tables, baseline implementations, or error bars. Without these data it is impossible to assess whether the reported gains are robust to post-hoc hyper-parameter choices or dataset selection. The full experimental section must include (i) per-benchmark quantitative metrics with standard deviations, (ii) ablation tables isolating each proposed component, and (iii) failure-case analysis for long-horizon rollouts.

Authors: We will revise the experimental section to present all quantitative results in clearly formatted tables that include per-benchmark metrics, standard deviations across multiple random seeds, explicit baseline implementation details, component-wise ablation tables, and a dedicated subsection analyzing failure modes observed in long-horizon rollouts. revision: yes
Referee: [§3.3] §3.3 (Anchor-based RoPE): the invariance claim for the mean-pooled anchor descriptor under anchor re-indexing is stated but not formally proven. Because object-token processing must remain permutation-equivariant while the pooled descriptor must be invariant, a short derivation or explicit invariance check under anchor permutation would strengthen the architectural justification.

Authors: We will insert a short formal derivation in §3.3 showing that the mean operation over anchors is symmetric and therefore invariant to re-indexing, while the attention mechanism operating on object tokens remains permutation-equivariant with respect to object ordering. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical architecture with independent design choices

full rationale

The paper introduces an empirical Transformer architecture for mesh-free rigid-body simulation. Its load-bearing elements (Anchor-Vertex Pooling, Anchor-based RoPE, differentiable Kabsch projection) are presented as engineering decisions motivated by the need to handle unordered point inputs, preserve local contact geometry, and enforce rigid outputs after learned updates. These choices do not reduce by construction to fitted parameters or prior self-citations; the reported performance gains are measured on external simulation benchmarks via standard supervised training. No self-definitional loops, renamed predictions, or load-bearing uniqueness theorems appear in the derivation. The central claims remain falsifiable against held-out data and baselines.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The model rests on standard rigid-body physics assumptions and learned neural weights; no new physical entities are postulated.

free parameters (1)

neural network weights and architecture hyperparameters
All model parameters are fitted during training on simulation data; no specific values are given in the abstract.

axioms (2)

domain assumption Objects remain rigid throughout the simulation
Enforced by the Kabsch projection step; stated as a modeling choice for the target domain.
domain assumption Contact can be adequately captured by local anchor-vertex features without global mesh connectivity
Core modeling decision that enables mesh-free operation.

pith-pipeline@v0.9.0 · 5562 in / 1340 out tokens · 55676 ms · 2026-05-12T02:14:37.683349+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel (J uniquely satisfies the functional equation) unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We introduce RigidFormer, an object-centric Transformer-based model that learns mesh-free rigid-body dynamics with controllable integration step sizes. ... Anchor-Vertex Pooling enriches these anchors with local vertex features ... differentiable Kabsch alignment.
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking (D=3 forced by linking) unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose Anchor-based RoPE to inject anchor geometry into attention ... mean-pooled anchor descriptor is invariant to anchor reindexing
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

On standard benchmarks, RigidFormer outperforms or matches mesh-based baselines using point inputs ... scales to 200+ objects

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

54 extracted references · 54 canonical work pages · 2 internal anchors

[1]

Learning rigid dynamics with face interaction graph networks.arXiv preprint arXiv:2212.03574, 2022

Kelsey R Allen, Yulia Rubanova, Tatiana Lopez-Guevara, William Whitney, Alvaro Sanchez- Gonzalez, Peter Battaglia, and Tobias Pfaff. Learning rigid dynamics with face interaction graph networks.arXiv preprint arXiv:2212.03574, 2022

work page arXiv 2022
[2]

Genesis: A universal and generative physics engine for robotics and beyond

Genesis Authors. Genesis: A universal and generative physics engine for robotics and beyond. URL https://github. com/Genesis-Embodied-AI/Genesis, 2024

work page 2024
[3]

Interaction networks for learning about objects, relations and physics.Advances in neural information processing systems, 29, 2016

Peter Battaglia, Razvan Pascanu, Matthew Lai, Danilo Jimenez Rezende, et al. Interaction networks for learning about objects, relations and physics.Advances in neural information processing systems, 29, 2016

work page 2016
[4]

Deep regression on manifolds: a 3d rotation case study

Romain Brégier. Deep regression on manifolds: a 3d rotation case study. In2021 International Conference on 3D Vision (3DV), pages 166–174. IEEE, 2021

work page 2021
[5]

SE3-Nets: Learning rigid body motion using deep neural networks

Arunkumar Byravan and Dieter Fox. SE3-Nets: Learning rigid body motion using deep neural networks. In2017 IEEE International Conference on Robotics and Automation (ICRA), pages 173–180. IEEE, 2017. doi: 10.1109/ICRA.2017.7989023

work page doi:10.1109/icra.2017.7989023 2017
[6]

A compositional object-based approach to learning physical dynamics.arXiv preprint arXiv:1612.00341, 2016

Michael B Chang, Tomer Ullman, Antonio Torralba, and Joshua B Tenenbaum. A compositional object-based approach to learning physical dynamics.arXiv preprint arXiv:1612.00341, 2016

work page arXiv 2016
[7]

Virtual elastic objects

Hsiao-yu Chen, Edith Tretschk, Tuur Stuyck, Petr Kadlecek, Ladislav Kavan, Etienne V ouga, and Christoph Lassner. Virtual elastic objects. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15827–15837, 2022

work page 2022
[8]

Learning neural event functions for ordinary differential equations.arXiv preprint arXiv:2011.03902, 2020

Ricky TQ Chen, Brandon Amos, and Maximilian Nickel. Learning neural event functions for ordinary differential equations.arXiv preprint arXiv:2011.03902, 2020

work page arXiv 2011
[9]

Pybullet, a python module for physics simulation for games, robotics and machine learning, 2016

Erwin Coumans and Yunfei Bai. Pybullet, a python module for physics simulation for games, robotics and machine learning, 2016

work page 2016
[10]

Bert: Pre-training of deep bidirectional transformers for language understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. InProceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), pages 4171–4186, 2019

work page 2019
[11]

arXiv preprint arXiv:2106.13281 , year=

C Daniel Freeman, Erik Frey, Anton Raichuk, Sertan Girgin, Igor Mordatch, and Olivier Bachem. Brax–a differentiable physics engine for large scale rigid body simulation.arXiv preprint arXiv:2106.13281, 2021

work page arXiv 2021
[12]

Fast r-cnn

Ross Girshick. Fast r-cnn. InProceedings of the IEEE international conference on computer vision, pages 1440–1448, 2015

work page 2015
[13]

Clustering to minimize the maximum intercluster distance.Theoretical computer science, 38:293–306, 1985

Teofilo F Gonzalez. Clustering to minimize the maximum intercluster distance.Theoretical computer science, 38:293–306, 1985

work page 1985
[14]

Klaus Greff, Francois Belletti, Lucas Beyer, Carl Doersch, Yilun Du, Daniel Duckworth, David J. Fleet, Dan Gnanapragasam, Florian Golemo, Charles Herrmann, Thomas Kipf, Abhijit Kundu, Dmitry Lagun, Issam Laradji, Hsueh-Ti (Derek) Liu, Henning Meyer, Yishu Miao, Derek Nowrouzezahrai, Cengiz Oztireli, Etienne Pot, Noha Radwan, Daniel Rebain, Sara Sabour, Me...

work page 2022
[15]

Rotary position embedding for vision transformer

Byeongho Heo, Song Park, Dongyoon Han, and Sangdoo Yun. Rotary position embedding for vision transformer. InEuropean Conference on Computer Vision, pages 289–305. Springer, 2024. 11

work page 2024
[16]

arXiv preprint arXiv:1910.00935 , year=

Yuanming Hu, Luke Anderson, Tzu-Mao Li, Qi Sun, Nathan Carr, Jonathan Ragan-Kelley, and Frédo Durand. Difftaichi: Differentiable programming for physical simulation.arXiv preprint arXiv:1910.00935, 2019

work page arXiv 1910
[17]

PointWorld: Scaling 3D world models for in-the-wild robotic manipulation.arXiv preprint arXiv:2601.03782, 2026

Wenlong Huang, Yu-Wei Chao, Arsalan Mousavian, Ming-Yu Liu, Dieter Fox, Kaichun Mo, and Li Fei-Fei. Pointworld: Scaling 3d world models for in-the-wild robotic manipulation. arXiv preprint arXiv:2601.03782, 2026

work page arXiv 2026
[18]

A solution for the best rotation to relate two sets of vectors.Foundations of Crystallography, 32(5):922–923, 1976

Wolfgang Kabsch. A solution for the best rotation to relate two sets of vectors.Foundations of Crystallography, 32(5):922–923, 1976

work page 1976
[19]

Object dynamics modeling with hierarchical point cloud-based representations

Chanho Kim and Li Fuxin. Object dynamics modeling with hierarchical point cloud-based representations. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20977–20986, 2024

work page 2024
[20]

arXiv preprint arXiv:2603.15031 (2026)

Kimi Team, Guangyu Chen, Yu Zhang, Jianlin Su, Weixin Xu, Siyuan Pan, Yaoyu Wang, Yucheng Wang, Guanduo Chen, Bohong Yin, Yutian Chen, Junjie Yan, Ming Wei, Y . Zhang, Fanqing Meng, Chao Hong, Xiaotong Xie, Shaowei Liu, Enzhe Lu, Yunpeng Tai, Yanru Chen, Xin Men, Haiqing Guo, Y . Charles, Haoyu Lu, Lin Sui, Jinguo Zhu, Zaida Zhou, Weiran He, Weixiao Huang...

work page arXiv 2026
[21]

Mosca: Dynamic gaussian fusion from casual videos via 4d motion scaffolds

Jiahui Lei, Yijia Weng, Adam W Harley, Leonidas Guibas, and Kostas Daniilidis. Mosca: Dynamic gaussian fusion from casual videos via 4d motion scaffolds. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 6165–6177, 2025

work page 2025
[22]

Decoupled weight decay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InInternational Conference on Learning Representations, 2019

work page 2019
[23]

Warp: A high-performance python framework for gpu simulation and graphics

Miles Macklin. Warp: A high-performance python framework for gpu simulation and graphics. InNVIDIA GPU Technology Conference (GTC), volume 3, 2022

work page 2022
[24]

Isaac Gym: High Performance GPU-Based Physics Simulation For Robot Learning

Viktor Makoviychuk, Lukasz Wawrzyniak, Yunrong Guo, Michelle Lu, Kier Storey, Miles Macklin, David Hoeller, Nikita Rudin, Arthur Allshire, Ankur Handa, et al. Isaac gym: High performance gpu-based physics simulation for robot learning.arXiv preprint arXiv:2108.10470, 2021

work page internal anchor Pith review arXiv 2021
[25]

Mimickit: A reinforcement learning framework for motion imitation and control

Xue Bin Peng. Mimickit: A reinforcement learning framework for motion imitation and control. arXiv preprint arXiv:2510.13794, 2025

work page arXiv 2025
[26]

Amp: Adversarial motion priors for stylized physics-based character control.ACM Transactions on Graphics (ToG), 40(4):1–20, 2021

Xue Bin Peng, Ze Ma, Pieter Abbeel, Sergey Levine, and Angjoo Kanazawa. Amp: Adversarial motion priors for stylized physics-based character control.ACM Transactions on Graphics (ToG), 40(4):1–20, 2021

work page 2021
[27]

Ase: Large-scale reusable adversarial skill embeddings for physically simulated characters.ACM Transactions On Graphics (TOG), 41(4):1–17, 2022

Xue Bin Peng, Yunrong Guo, Lina Halper, Sergey Levine, and Sanja Fidler. Ase: Large-scale reusable adversarial skill embeddings for physically simulated characters.ACM Transactions On Graphics (TOG), 41(4):1–17, 2022

work page 2022
[28]

FiLM: Visual reasoning with a general conditioning layer

Ethan Perez, Florian Strub, Harm de Vries, Vincent Dumoulin, and Aaron Courville. FiLM: Visual reasoning with a general conditioning layer. InProceedings of the AAAI Conference on Artificial Intelligence, 2018

work page 2018
[29]

Learning mesh- based simulation with graph networks

Tobias Pfaff, Meire Fortunato, Alvaro Sanchez-Gonzalez, and Peter Battaglia. Learning mesh- based simulation with graph networks. InInternational conference on learning representations, 2020

work page 2020
[30]

Pointnet: Deep learning on point sets for 3d classification and segmentation

Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 652–660, 2017

work page 2017
[31]

Pointnet++: Deep hierarchical feature learning on point sets in a metric space.Advances in neural information processing systems, 30, 2017

Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas J Guibas. Pointnet++: Deep hierarchical feature learning on point sets in a metric space.Advances in neural information processing systems, 30, 2017. 12

work page 2017
[32]

Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free

Zihan Qiu, Zekun Wang, Bo Zheng, Zeyu Huang, Kaiyue Wen, Songlin Yang, Rui Men, Le Yu, Fei Huang, Suozhi Huang, et al. Gated attention for large language models: Non-linearity, sparsity, and attention-sink-free.arXiv preprint arXiv:2505.06708, 2025

work page internal anchor Pith review arXiv 2025
[33]

Learning rigid-body simulators over implicit shapes for large- scale scenes and vision.Advances in Neural Information Processing Systems, 37:125809– 125838, 2024

Yulia Rubanova, Tatiana Lopez-Guevara, Kelsey R Allen, William F Whitney, Kimberly Stachenfeld, and Tobias Pfaff. Learning rigid-body simulators over implicit shapes for large- scale scenes and vision.Advances in Neural Information Processing Systems, 37:125809– 125838, 2024

work page 2024
[34]

Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

work page 2024
[35]

Mujoco: A physics engine for model-based control

Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In2012 IEEE/RSJ international conference on intelligent robots and systems, pages 5026–5033. IEEE, 2012

work page 2012
[36]

Lion: Latent point diffusion models for 3d shape generation.Advances in Neural Information Processing Systems, 35:10021–10039, 2022

Arash Vahdat, Francis Williams, Zan Gojcic, Or Litany, Sanja Fidler, Karsten Kreis, et al. Lion: Latent point diffusion models for 3d shape generation.Advances in Neural Information Processing Systems, 35:10021–10039, 2022

work page 2022
[37]

Attention is all you need.Advances in neural information processing systems, 30, 2017

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017

work page 2017
[38]

6-PACK: Category-level 6D pose tracker with anchor-based keypoints

Chen Wang, Roberto Martín-Martín, Danfei Xu, Jun Lv, Cewu Lu, Li Fei-Fei, Silvio Savarese, and Yuke Zhu. 6-PACK: Category-level 6D pose tracker with anchor-based keypoints. In2020 IEEE International Conference on Robotics and Automation (ICRA), pages 10059–10066. IEEE,

work page
[39]

doi: 10.1109/ICRA40945.2020.9196679

work page doi:10.1109/icra40945.2020.9196679 2020
[40]

Tracking everything everywhere all at once

Qianqian Wang, Yen-Yu Chang, Ruojin Cai, Zhengqi Li, Bharath Hariharan, Aleksander Holynski, and Noah Snavely. Tracking everything everywhere all at once. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 19795–19806, 2023

work page 2023
[41]

Integrating physics and topology in neural networks for learning rigid body dynamics.Nature Communications, 16(1):6867, 2025

Amaury Wei and Olga Fink. Integrating physics and topology in neural networks for learning rigid body dynamics.Nature Communications, 16(1):6867, 2025

work page 2025
[42]

Learning 3d particle-based simulators from rgb-d videos.arXiv preprint arXiv:2312.05359, 2023

William F Whitney, Tatiana Lopez-Guevara, Tobias Pfaff, Yulia Rubanova, Thomas Kipf, Kimberly Stachenfeld, and Kelsey R Allen. Learning 3d particle-based simulators from rgb-d videos.arXiv preprint arXiv:2312.05359, 2023

work page arXiv 2023
[43]

Modeling the real world with high-density visual particle dynamics.arXiv preprint arXiv:2406.19800, 2024

William F Whitney, Jacob Varley, Deepali Jain, Krzysztof Choromanski, Sumeet Singh, and Vikas Sindhwani. Modeling the real world with high-density visual particle dynamics.arXiv preprint arXiv:2406.19800, 2024

work page arXiv 2024
[44]

Pointflow: 3d point cloud generation with continuous normalizing flows

Guandao Yang, Xun Huang, Zekun Hao, Ming-Yu Liu, Serge Belongie, and Bharath Hariharan. Pointflow: 3d point cloud generation with continuous normalizing flows. InProceedings of the IEEE/CVF international conference on computer vision, pages 4541–4550, 2019

work page 2019
[45]

Learning flexible body collision dynamics with hierarchical contact mesh transformer.arXiv preprint arXiv:2312.12467, 2023

Youn-Yeol Yu, Jeongwhan Choi, Woojin Cho, Kookjin Lee, Nayong Kim, Kiseok Chang, Chang- Seung Woo, Ilho Kim, Seok-Woo Lee, Joon-Young Yang, et al. Learning flexible body collision dynamics with hierarchical contact mesh transformer.arXiv preprint arXiv:2312.12467, 2023

work page arXiv 2023
[46]

Egode: An event-attended graph ode framework for modeling rigid dynamics.Advances in Neural Information Processing Systems, 37:59093–59118, 2024

Jingyang Yuan, Gongbo Sun, Zhiping Xiao, Hang Zhou, Xiao Luo, Junyu Luo, Yusheng Zhao, Wei Ju, and Ming Zhang. Egode: An event-attended graph ode framework for modeling rigid dynamics.Advances in Neural Information Processing Systems, 37:59093–59118, 2024

work page 2024
[47]

Renderformer: Transformer- based neural rendering of triangle meshes with global illumination

Chong Zeng, Yue Dong, Pieter Peers, Hongzhi Wu, and Xin Tong. Renderformer: Transformer- based neural rendering of triangle meshes with global illumination. InProceedings of the Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers, pages 1–11, 2025

work page 2025
[48]

Monst3r: A simple approach for estimat- ing geometry in the presence of motion.arXiv preprint arXiv:2410.03825, 2024

Junyi Zhang, Charles Herrmann, Junhwa Hur, Varun Jampani, Trevor Darrell, Forrester Cole, Deqing Sun, and Ming-Hsuan Yang. Monst3r: A simple approach for estimating geometry in the presence of motion.arXiv preprint arXiv:2410.03825, 2024. 13

work page arXiv 2024
[49]

Point transformer

Hengshuang Zhao, Li Jiang, Jiaya Jia, Philip HS Torr, and Vladlen Koltun. Point transformer. In Proceedings of the IEEE/CVF international conference on computer vision, pages 16259–16268, 2021

work page 2021
[50]

Tesseract: Learning 4d embodied world models, 2025

Haoyu Zhen, Qiao Sun, Hongxin Zhang, Junyan Li, Siyuan Zhou, Yilun Du, and Chuang Gan. Tesseract: learning 4d embodied world models.arXiv preprint arXiv:2504.20995, 2025

work page arXiv 2025
[51]

Extending lagrangian and hamiltonian neural networks with differentiable contact models.Advances in Neural Information Processing Systems, 34:21910–21922, 2021

Yaofeng Desmond Zhong, Biswadip Dey, and Amit Chakraborty. Extending lagrangian and hamiltonian neural networks with differentiable contact models.Advances in Neural Information Processing Systems, 34:21910–21922, 2021

work page 2021
[52]

the decoder object tokens

Yi Zhou, Connelly Barnes, Jingwan Lu, Jimei Yang, and Hao Li. On the continuity of rotation representations in neural networks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5745–5753, 2019. 14 Appendix Contents A. More Qualitative Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ...

work page 2019
[53]

Physics Export:We run the trained policy in Isaac Gym [ 24] and record per-frame rigid body transforms (position and quaternion) for each body part

work page
[54]

Meshes exceeding vertex limits undergo quadric decimation

Mesh Conversion:Body part transforms are applied to reference meshes from MuJoCo XML files. Meshes exceeding vertex limits undergo quadric decimation. Temporal Subsampling.We train with step size s=10 for both ASE and G1 (3 Hz effective rate) to capture meaningful locomotion dynamics rather than high-frequency contact oscillations. Training ConfigurationW...

work page