pith. sign in

arxiv: 2607.01938 · v1 · pith:UL6KRC2Jnew · submitted 2026-07-02 · 💻 cs.RO · cs.AI· cs.CL· cs.CV· cs.LG

PhysMani: Physics-principled 3D World Model for Dynamic Object Manipulation

Pith reviewed 2026-07-03 12:01 UTC · model grok-4.3

classification 💻 cs.RO cs.AIcs.CLcs.CVcs.LG
keywords 3D Gaussian world modeldivergence-free velocity fielddynamic object manipulationfuture-aware policyPhysMani-Benchonline optimizationroboticsphysics-principled prediction
0
0 comments X

The pith

PhysMani couples a 3D Gaussian world model enforcing a divergence-free velocity field with a future-aware policy to improve dynamic object manipulation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces PhysMani to address manipulation of fast-moving targets in unstructured 3D scenes, where prior visual-language-action and world models lack accurate geometry and physical forecasting. It pairs a physics-principled 3D Gaussian world model that learns a divergence-free velocity field through online optimization with a policy model that attends to the predicted scene dynamics. A new benchmark, PhysMani-Bench, contains 16 tasks and shows higher success rates than baselines in both simulation and real robot trials. A sympathetic reader would care because reliable physical prediction of moving objects could make embodied systems more effective at real-time interaction without extensive retraining.

Core claim

PhysMani couples a physics-principled 3D Gaussian world model with a future-aware action policy model. The world model learns a divergence-free Gaussian velocity field via online optimization for fast and physically grounded future dynamics prediction. The policy model integrates the predicted 3D scene future dynamics through a learnable token based cross-attention module. On the introduced PhysMani-Bench with 16 tasks, the framework achieves superior success rates over strong baselines in both simulation and real-world robot experiments.

What carries the argument

divergence-free Gaussian velocity field learned via online optimization inside a 3D Gaussian world model, which supplies physically consistent future dynamics to the policy via cross-attention

If this is right

  • The world model supplies 3D scene futures that the policy can use directly for action selection in dynamic settings.
  • Online optimization of the velocity field allows the system to adapt predictions without full retraining on new scenes.
  • The same framework yields measurable gains on a 16-task benchmark in both simulated and physical robot settings.
  • Cross-attention integration of predicted dynamics improves handling of unstructured environments over models lacking explicit physics constraints.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The divergence-free constraint might generalize to other fluid or rigid-body prediction tasks if the online solver remains stable at higher speeds.
  • Real-world deployment could benefit from testing whether the Gaussian representation scales to cluttered scenes with partial observability.
  • Combining the velocity field with additional constraints such as conservation of momentum could further reduce prediction drift over longer horizons.

Load-bearing premise

Enforcing a divergence-free condition on the learned Gaussian velocity field through online optimization produces forecasts accurate and meaningful enough to improve policy decisions with fast-moving targets.

What would settle it

A direct comparison on PhysMani-Bench tasks with rapid motion where the full PhysMani pipeline shows no statistically significant gain in success rate over the policy model alone or over prior world-model baselines.

Figures

Figures reproduced from arXiv: 2607.01938 by Bo Yang, Hao Li, Jianan Wang, Jinxi Li, Peng Yun, Shouwang Huang.

Figure 1
Figure 1. Figure 1: The overall framework of our method. It features a physics-principled 3D Gaus￾sian world model and a future-aware action policy model. governed by physical laws, and use those predictions to guide precise actions, all in 3D space. To accomplish such tasks, a potential strategy is to leverage existing powerful VLAs [55] or video-based world models [17] to jointly predict future states and action policies. H… view at source ↗
Figure 2
Figure 2. Figure 2: The left panel shows the canonical 3D Gaussian module. The top-right panel shows the physics-principled Gaussian velocity module, and the bottom-right panel illustrates the online optimization process. physics simulators [15] to compute object-centric dynamics from known phys￾ical properties, but such approaches are typically restricted to specific object categories and lack generality across diverse objec… view at source ↗
Figure 3
Figure 3. Figure 3: The left block shows the encoding of visual observations and language. The right block shows the incorporation of 3D scene future dynamics for action prediction. – Step #3: For the entire 3D scene point cloud Pt, we similarly compute rela￾tive offsets to all K neighboring Gaussians, yielding ∆P ∈ R4096×K×3 , and retrieve their basic velocity components, denoted by Dˆt ∈ R4096×K×6 . – Step #4: Both ∆P and D… view at source ↗
Figure 4
Figure 4. Figure 4: Examples of diverse and challenging dynamic tasks in our PhysMani-Bench. The blue arrows illustrate the complex motions of dynamic targets. During inference, given visual observations, our framework continuously pre￾dicts the 3D scene’s future dynamics and uses them to infer future action key￾poses, which are then converted into joint commands via inverse kinematics for execution. More details of the netwo… view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative comparison of future frame prediction. Red circles highlight that our method can accurately predict the movement of dynamic targets. our method and ManiGaussian; the other baselines in [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: The PSNR scores at each timestep, which is the mean over the next 10 future frames. PhysMani requires 205 ms/frame, whereas FreeGave takes 607 ms/frame. For target trajectory prediction, consecutive frames are captured at 50 ms intervals, meaning the 1st, 5th, and 10th future frames translate to horizons of 50, 250, and 500 ms [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Illustration of the physical robot setup and real-world dynamic tasks. The robot is Astribot S1 [18]. of 0.008, 0.039, and 0.074 m across these timesteps. In contrast, ManiGaussian produces a substantial error of 0.388 m after just a single 50 ms step [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Mean SR of PhysMani over all 16 tasks under different world model optimiza￾tion iterations T. 10 20 30 40 50 T 42 44 46 48 50 Mean SR 45.8 ± 0.6 44.7 ± 1.7 45.0 ± 0.9 46.0 ± 0.7 45.4 ± 0.9 – placing a toy onto a moving belt: The robot must pick up a toy such as a plastic onion and lemon, and then place it onto a moving belt. – placing a cube onto a rotating rack: The robot must pick up a cube and then plac… view at source ↗
Figure 9
Figure 9. Figure 9: Visualizations of the learned six basic velocity components. Note that, if we entirely remove our 3D world model and the incorporation module, the resulting framework is exactly the 3DFA backbone, which already performs worse than our method, as shown in Tables 1&3. Therefore, there is no need to conduct a separate ablation experiment. (2) Removing the design of learnable token L: In Step #5 of incorpo￾rat… view at source ↗
Figure 10
Figure 10. Figure 10: The network structure of fvel. MLP(51, 128) means a multi-layer perceptron with 51 input features and 128 hidden features [PITH_FULL_IMAGE:figures/full_fig_p021_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: The network to process ∆P and Dˆt to obtain future-aware tokens At. Physics-principled Gaussian Velocity Module: The network structure of fvel follows FreeGave [37], which contains two parts: fcode and fweight. Fig￾ure 10 illustrates the details of fvel. More details can be found in FreeGave. Given new observations at current time t, we update fvel for 50 iterations (freezing all Gaussian parameters) and … view at source ↗
Figure 12
Figure 12. Figure 12: Detailed physical parameters of our PhysMani-bench tasks. For example, the movement range "basketball: x: [0.0, 0.3], y: [-0.5, 0.5], z: 0.8" indicates that the bas￾ketball’s motion is restricted to a rectangular region in the xy-plane which spans from 0.0 ∼ 0.3m along the x-axis and −0.5 ∼ 0.5m along the y-axis at a fixed height of z = 0.8 m [PITH_FULL_IMAGE:figures/full_fig_p024_12.png] view at source ↗
read the original abstract

Manipulating fast and dynamically moving targets in unstructured 3D environments remains challenging for embodied AI. Existing visual-language-action models and world models struggle with accurate 3D geometry and physically meaningful forecasting. We propose PhysMani, a framework that couples a physics-principled 3D Gaussian world model with a future-aware action policy model. The world model learns a divergence-free Gaussian velocity field via online optimization for fast and physically grounded future dynamics prediction. The policy model integrates the predicted 3D scene future dynamics through a learnable token based cross-attention module. We introduce PhysMani-Bench, a dynamic manipulation benchmark with 16 tasks, and demonstrate a superior success rate over strong baselines in both simulation and real-world robot experiments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims to introduce PhysMani, which couples a physics-principled 3D Gaussian world model learning a divergence-free Gaussian velocity field via online optimization for future dynamics prediction with a future-aware action policy model using cross-attention. It introduces PhysMani-Bench with 16 tasks and demonstrates superior success rates over strong baselines in simulation and real-world robot experiments.

Significance. If the result holds and the physics constraint is shown to be responsible, it could advance the field by providing physically grounded world models for dynamic manipulation tasks. However, the current presentation does not allow assessment of whether the divergence-free condition improves forecasts for rigid-body dynamics under contacts and gravity.

major comments (2)
  1. [Abstract] The abstract asserts superior success rates but supplies no metrics, baseline descriptions, error analysis, or validation that the divergence-free constraint actually drives the gains rather than other modeling choices.
  2. The central claim requires that the learned velocity field (enforced ∇·v=0) produces forecasts accurate enough to improve the cross-attention policy over baselines. This condition is appropriate for incompressible flow but is only an approximation for the rigid, colliding, and gravity-driven objects in PhysMani-Bench; without specific results showing reduced prediction error attributable to this constraint, the success-rate gains cannot be attributed to the physics principle.
minor comments (1)
  1. The manuscript would benefit from including quantitative tables with success rates, prediction errors, and ablation studies on the divergence-free constraint.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the concerns regarding the abstract and the attribution of gains to the divergence-free constraint below, and will incorporate revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract] The abstract asserts superior success rates but supplies no metrics, baseline descriptions, error analysis, or validation that the divergence-free constraint actually drives the gains rather than other modeling choices.

    Authors: We agree that the abstract would be strengthened by including quantitative metrics and baseline details. In the revised version, we will update the abstract to report specific success rates on PhysMani-Bench (e.g., average improvement over baselines), briefly describe the 16 tasks, and note the role of the physics constraint. We will also add a short reference to the ablation results validating the constraint's contribution. revision: yes

  2. Referee: The central claim requires that the learned velocity field (enforced ∇·v=0) produces forecasts accurate enough to improve the cross-attention policy over baselines. This condition is appropriate for incompressible flow but is only an approximation for the rigid, colliding, and gravity-driven objects in PhysMani-Bench; without specific results showing reduced prediction error attributable to this constraint, the success-rate gains cannot be attributed to the physics principle.

    Authors: We acknowledge that the divergence-free constraint is an approximation for rigid-body dynamics involving contacts and gravity, rather than a perfect model of incompressible flow. The manuscript presents the constraint as a useful inductive bias for stable velocity field learning in 3D Gaussians. To directly address attribution, we will add ablation experiments in the revision that compare prediction error (e.g., mean squared velocity error and divergence metrics) and downstream success rates with and without the ∇·v=0 enforcement. This will provide evidence on whether the constraint reduces forecast error and drives policy improvements. We maintain that the online optimization with this constraint yields more physically grounded predictions than unconstrained alternatives in our tasks. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain.

full rationale

The abstract and provided text describe a new coupling of a 3D Gaussian world model (with divergence-free velocity field learned via online optimization) to a cross-attention policy. No equations, self-citations, or steps are exhibited that reduce the claimed 'physically grounded future dynamics prediction' to a fitted input or prior result by construction. The optimization and policy integration are presented as independent methodological contributions rather than tautological renamings or self-referential definitions. The derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no identifiable free parameters, axioms, or invented entities; the divergence-free condition is presented as a physics principle but its precise mathematical enforcement is not specified.

pith-pipeline@v0.9.1-grok · 5678 in / 1189 out tokens · 67821 ms · 2026-07-03T12:01:25.111227+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

71 extracted references · 23 canonical work pages · 13 internal anchors

  1. [1]

    GPT-4 Technical Report

    Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al.: GPT-4 technical report. arXiv preprint arXiv:2303.08774 (2023)

  2. [2]

    Cosmos World Foundation Model Platform for Physical AI

    Agarwal, N., Ali, A., Bala, M., Balaji, Y., Barker, E., Cai, T., Chattopadhyay, P., Chen, Y., Cui, Y., Ding, Y., et al.: Cosmos world foundation model platform for physical AI. arXiv preprint arXiv:2501.03575 (2025)

  3. [3]

    In: IEEE/RSJ International Conference on Intelligent Robots and Systems

    Akinola, I., Xu, J., Song, S., Allen, P.K.: Dynamic grasping with reachability and motion awareness. In: IEEE/RSJ International Conference on Intelligent Robots and Systems. pp. 9422–9429 (2021)

  4. [4]

    World Simulation with Video Foundation Models for Physical AI

    Ali, A., Bai, J., Bala, M., Balaji, Y., Blakeman, A., Cai, T., Cao, J., Cao, T., Cha, E., Chao, Y.W., et al.: World simulation with video foundation models for physical AI. arXiv preprint arXiv:2511.00062 (2025)

  5. [5]

    V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

    Assran, M., Bardes, A., Fan, D., Garrido, Q., Howes, R., Muckley, M., Rizvi, A., Roberts, C., Sinha, K., Zholus, A., et al.: V-JEPA 2: Self-supervised video models enable understanding, prediction and planning. arXiv preprint arXiv:2506.09985 (2025)

  6. [6]

    In: SIGGRAPH Asia 2024 Conference Papers

    Bar-Tal, O., Chefer, H., Tov, O., Herrmann, C., Paiss, R., Zada, S., Ephrat, A., Hur, J., Liu, G., Raj, A., et al.: Lumiere: A space-time diffusion model for video generation. In: SIGGRAPH Asia 2024 Conference Papers. pp. 1–11 (2024)

  7. [7]

    In: International Conference on Learning Representa- tions

    Barcellona, L., Zadaianchuk, A., Allegro, D., Papa, S., Ghidoni, S., Gavves, E.: Dream to manipulate: Compositional world models empowering robot imitation learning with imagination. In: International Conference on Learning Representa- tions. vol. 2025, pp. 56729–56763 (2025)

  8. [8]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Bi, H., Tan, H., Xie, S., Wang, Z., Huang, S., Liu, H., Zhao, R., Feng, Y., Xiang, C., Rong, Y., et al.: Motus: A unified latent action world model. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 35101–35113 (2026)

  9. [9]

    In: Conference on Robot Learning (2025)

    Black, K., Brown, N., Darpinian, J., Dhabalia, K., Driess, D., Esmail, A., Equi, M.R., Finn, C., Fusai, N., Galliker, M.Y., et al.: pi0.5: a vision-language-action model with open-world generalization. In: Conference on Robot Learning (2025)

  10. [10]

    In: Proceedings of Robotics: Science and Systems

    Black, K., Brown, N., Driess, D., Esmail, A., Equi, M.R., Finn, C., Fusai, N., Groom, L., Hausman, K., Ichter, B., Jakubczak, S., Jones, T., Ke, L., Levine, S., Li-Bell, A., Mothukuri, M., Nair, S., Pertsch, K., Shi, L.X., Smith, L., Tanner, J., Vuong, Q., Walling, A., Wang, H., Zhilinsky, U.: pi0: A Vision-Language-Action Flow Model for General Robot Con...

  11. [11]

    Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

    Blattmann, A., Dockhorn, T., Kulal, S., Mendelevitch, D., Kilian, M., Lorenz, D., Levi, Y., English, Z., Voleti, V., Letts, A., et al.: Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127 (2023) PhysMani 17

  12. [12]

    In: International Conference on Machine Learning (2024)

    Bruce, J., Dennis, M.D., Edwards, A., Parker-Holder, J., Shi, Y., Hughes, E., Lai, M., Mavalankar, A., Steigerwald, R., Apps, C., et al.: Genie: Generative interactive environments. In: International Conference on Machine Learning (2024)

  13. [13]

    arXiv preprint arXiv:2601.00051 (2026)

    Chen, Y., Liang, Y., Wang, J., Chen, T., Cheng, J., Gu, Z., Huang, Y., Jiang, Z., Li, W., Li, T., et al.: TeleWorld: Towards dynamic multimodal synthesis with a 4d world model. arXiv preprint arXiv:2601.00051 (2026)

  14. [14]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Christen, S., Yang, W., Pérez-D’Arpino, C., Hilliges, O., Fox, D., Chao, Y.W.: Learning human-to-robot handovers from point clouds. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 9654– 9664 (2023)

  15. [15]

    Coumans, E., Bai, Y.: PyBullet, a Python module for physics simulation for games, robotics and machine learning.http://pybullet.org(2016–2021), last accessed 28 Jun 2026

  16. [16]

    In: IEEE International Conference on Robotics and Automation

    D’Ambrosio, D.B., Abeyruwan, S., Graesser, L., Iscen, A., Amor, H.B., Bewley, A., Reed, B.J., Reymann, K., Takayama, L., Tassa, Y., et al.: Achieving human level competitive robot table tennis. In: IEEE International Conference on Robotics and Automation. pp. 74–82 (2025)

  17. [17]

    ACM Computing Surveys58(3), 1–38 (2025)

    Ding, J., Zhang, Y., Shang, Y., Zhang, Y., Zong, Z., Feng, J., Yuan, Y., Su, H., Li, N., Sukiennik, N., et al.: Understanding world or predicting future? a compre- hensive survey of world models. ACM Computing Surveys58(3), 1–38 (2025)

  18. [18]

    arXiv preprint arXiv:2507.17141 (2025)

    Gao, G., Wang, J., Zuo, J., Jiang, J., Zhang, J., Zeng, X., Zhu, Y., Ma, L., Chen, K., Sheng, M., et al.: Towards human-level intelligence via human-like whole-body manipulation. arXiv preprint arXiv:2507.17141 (2025)

  19. [19]

    In: Conference on Robot Learn- ing (2023)

    Gervet, T., Xian, Z., Gkanatsios, N., Fragkiadaki, K.: Act3D: 3D feature field transformers for multi-task robotic manipulation. In: Conference on Robot Learn- ing (2023)

  20. [20]

    arXiv preprint arXiv:2508.11002 (2025)

    Gkanatsios, N., Xu, J., Bronars, M., Mousavian, A., Ke, T.W., Fragkiadaki, K.: 3D FlowMatch Actor: Unified 3D policy for single-and dual-arm manipulation. arXiv preprint arXiv:2508.11002 (2025)

  21. [21]

    In: Conference on Robot Learning

    Goyal, A., Xu, J., Guo, Y., Blukis, V., Chao, Y.W., Fox, D.: RVT: Robotic view transformer for 3D object manipulation. In: Conference on Robot Learning. pp. 694–710 (2023)

  22. [22]

    The Llama 3 Herd of Models

    Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Let- man,A.,Mathur,A.,Schelten,A.,Vaughan,A.,etal.:TheLlama3herdofmodels. arXiv preprint arXiv:2407.21783 (2024)

  23. [23]

    World Models

    Ha, D., Schmidhuber, J.: World models. arXiv preprint arXiv:1803.101222(3), 440 (2018)

  24. [24]

    Nature640(8059), 647–653 (2025)

    Hafner, D., Pasukonis, J., Ba, J., Lillicrap, T.: Mastering diverse control tasks through world models. Nature640(8059), 647–653 (2025)

  25. [25]

    In: Advances in Neural Information Processing Systems

    Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. In: Advances in Neural Information Processing Systems. NIPS ’20, Curran Associates Inc., Red Hook, NY, USA (2020)

  26. [26]

    In: International Con- ference on Learning Representations (2022)

    Hu, E.J., yelong shen, Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W.: LoRA: Low-rank adaptation of large language models. In: International Con- ference on Learning Representations (2022)

  27. [27]

    IEEE Robotics and Automation Letters5(2), 3019–3026 (2020)

    James, S., Ma, Z., Arrojo, D.R., Davison, A.J.: RLBench: The robot learning benchmark & learning environment. IEEE Robotics and Automation Letters5(2), 3019–3026 (2020)

  28. [28]

    In: Proceedings 18 P

    James, S., Wada, K., Laidlow, T., Davison, A.J.: Coarse-to-fine q-attention: Effi- cient learning for visual robotic manipulation via discretisation. In: Proceedings 18 P. Yun et al. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 13739–13748 (2022)

  29. [29]

    IEEE Access (2025)

    Kawaharazuka, K., Oh, J., Yamada, J., Posner, I., Zhu, Y.: Vision-language-action models for robotics: A review towards real-world applications. IEEE Access (2025)

  30. [30]

    In: Conference on Robot Learning (2024)

    Ke, T.W., Gkanatsios, N., Fragkiadaki, K.: 3D Diffuser Actor: Policy diffusion with 3D scene representations. In: Conference on Robot Learning (2024)

  31. [31]

    ACM Transactions on Graphics42(4) (Jul 2023)

    Kerbl, B., Kopanas, G., Leimkuehler, T., Drettakis, G.: 3D Gaussian splatting for real-time radiance field rendering. ACM Transactions on Graphics42(4) (Jul 2023)

  32. [32]

    In: Conference on Robot Learning (2024)

    Kim, M.J., Pertsch, K., Karamcheti, S., Xiao, T., Balakrishna, A., Nair, S., Rafailov, R., Foster, E.P., Sanketi, P.R., Vuong, Q., et al.: OpenVLA: An open- source vision-language-action model. In: Conference on Robot Learning (2024)

  33. [33]

    In: International Conference on Autonomous Agents

    Kitano, H., Asada, M., Kuniyoshi, Y., Noda, I., Osawa, E.: Robocup: The robot world cup initiative. In: International Conference on Autonomous Agents. pp. 340– 347 (1997)

  34. [34]

    HunyuanVideo: A Systematic Framework For Large Video Generative Models

    Kong, W., Tian, Q., Zhang, Z., Min, R., Dai, Z., Zhou, J., Xiong, J., Li, X., Wu, B., Zhang, J., et al.: HunyuanVideo: A systematic framework for large video generative models. arXiv preprint arXiv:2412.03603 (2024)

  35. [35]

    In: Thirty-seventh Conference on Neural Information Processing Systems (2023)

    Li, J., Song, Z., Yang, B.: NVFi: Neural velocity fields for 3d physics learning from dynamic videos. In: Thirty-seventh Conference on Neural Information Processing Systems (2023)

  36. [36]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Li, J., Song, Z., Yang, B.: TRACE: Learning 3D Gaussian physical dynamics from multi-view videos. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 8820–8829 (2025)

  37. [37]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Li, J., Song, Z., Zhou, S., Yang, B.: FreeGave: 3D physics learning from dynamic videos by Gaussian velocity. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 12433–12443 (2025)

  38. [38]

    arXiv preprint arXiv:2504.16693 (2025)

    Li, W., Zhao, H., Yu, Z., Du, Y., Zou, Q., Hu, R., Xu, K.: PIN-WM: Learning physics-informed world models for non-prehensile manipulation. arXiv preprint arXiv:2504.16693 (2025)

  39. [39]

    A Comprehensive Survey on World Models for Embodied AI

    Li, X., He, X., Zhang, L., Wu, M., Li, X., Liu, Y.: A comprehensive survey on world models for embodied AI. arXiv preprint arXiv:2510.16732 (2025)

  40. [40]

    In: Inter- national Conference on Learning Representations (2019)

    Li, Y., Wu, J., Tedrake, R., Tenenbaum, J.B., Torralba, A.: Learning particle dynamics for manipulating rigid bodies, deformable objects, and fluids. In: Inter- national Conference on Learning Representations (2019)

  41. [41]

    DeepSeek-V3 Technical Report

    Liu, A., Feng, B., Xue, B., Wang, B., Wu, B., Lu, C., Zhao, C., Deng, C., Zhang, C., Ruan, C., et al.: DeepSeek-V3 technical report. arXiv preprint arXiv:2412.19437 (2024)

  42. [42]

    In: Advances in Neural Information Processing Systems

    Liu, B., Zhu, Y., Gao, C., Feng, Y., Liu, Q., Zhu, Y., Stone, P.: LIBERO: bench- marking knowledge transfer for lifelong robot learning. In: Advances in Neural Information Processing Systems. NIPS ’23, Curran Associates Inc., Red Hook, NY, USA (2023)

  43. [43]

    arXiv preprint arXiv:2210.13431 (2022)

    Liu, H., Lee, L., Lee, K., Abbeel, P.: Instruction-following agents with multimodal transformer. arXiv preprint arXiv:2210.13431 (2022)

  44. [44]

    In: Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., Levine, S

    Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. In: Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., Levine, S. (eds.) Advances in Neural Information Processing Systems. vol. 36, pp. 34892–34916. Curran Associates, Inc. (2023)

  45. [45]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Liu, J., Zhang, R., Fang, H.S., Gou, M., Fang, H., Wang, C., Xu, S., Yan, H., Lu, C.: Target-referenced reactive grasping for dynamic objects. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 8824–8833 (2023) PhysMani 19

  46. [46]

    In: International Conference on Learning Representations (2023)

    Liu, X., Gong, C., et al.: Flow straight and fast: Learning to generate and transfer data with rectified flow. In: International Conference on Learning Representations (2023)

  47. [47]

    In: European Conference on Computer Vision

    Lu, G., Zhang, S., Wang, Z., Liu, C., Lu, J., Tang, Y.: ManiGaussian: Dynamic Gaussian splatting for multi-task robotic manipulation. In: European Conference on Computer Vision. pp. 349–366 (2024)

  48. [48]

    IEEE Transactions on Neural Networks and Learning Systems (2026)

    Ma, Y., Song, Z., Zhuang, Y., Hao, J., King, I.: A survey on vision–language–action models for embodied AI. IEEE Transactions on Neural Networks and Learning Systems (2026)

  49. [49]

    Autonomous Robots43(5), 1241–1256 (2019)

    Marturi, N., Kopicki, M., Rastegarpanah, A., Rajasekaran, V., Adjigble, M., Stolkin, R., Leonardis, A., Bekiroglu, Y.: Dynamic grasp and trajectory planning for moving objects. Autonomous Robots43(5), 1241–1256 (2019)

  50. [50]

    IEEE Robotics and Automation Letters7(3), 7327–7334 (2022)

    Mees, O., Hermann, L., Rosete-Beas, E., Burgard, W.: CALVIN: A benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. IEEE Robotics and Automation Letters7(3), 7327–7334 (2022)

  51. [51]

    Motamed, S., Culp, L., Swersky, K., Jaini, P., Geirhos, R.: Do generative video models understand physical principles? In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. pp. 948–958 (2026)

  52. [52]

    In: Conference on Robot Learning

    Noh, D., Kong, D., Zhao, M., Lizarraga, A., Xie, J., Wu, Y.N., Hong, D.: Latent adaptive planner for dynamic manipulation. In: Conference on Robot Learning. pp. 2430–2448 (2025)

  53. [53]

    In: Proceedings of Robotics: Science and Systems

    Octo Model Team, Ghosh, D., Walke, H., Pertsch, K., Black, K., Mees, O., Dasari, S., Hejna, J., Xu, C., Luo, J., Kreiman, T., Tan, Y., Sanketi, P., Vuong, Q., Xiao, T., Sadigh, D., Finn, C., Levine, S.: Octo: An open-source generalist robot policy. In: Proceedings of Robotics: Science and Systems. Delft, Netherlands (2024)

  54. [54]

    Qwen2.5-VL Technical Report

    Qwen Team: Qwen2.5-VL Technical Report. arXiv preprint arXiv:2502.13923 (2025)

  55. [55]

    arXiv preprint arXiv:2505.04769 (2025)

    Sapkota, R., Cao, Y., Roumeliotis, K.I., Karkee, M.: Vision-language-action models: Concepts, progress, applications and challenges. arXiv preprint arXiv:2505.04769 (2025)

  56. [56]

    In: Conference on Robot Learning

    Shridhar, M., Manuelli, L., Fox, D.: Perceiver-Actor: A multi-task transformer for robotic manipulation. In: Conference on Robot Learning. pp. 785–799 (2023)

  57. [57]

    arXiv preprint arXiv:2511.23429 (2025)

    Tang, J., Liu, J., Li, J., Wu, L., Yang, H., Zhao, P., Gong, S., Yuan, X., Shao, S., Zhang, L., et al.: Hunyuan-gamecraft-2: Instruction-following interactive game world model. arXiv preprint arXiv:2511.23429 (2025)

  58. [58]

    In: Advances in Neural Information Processing Systems

    Vaswani,A.,Shazeer,N.,Parmar,N.,Uszkoreit,J.,Jones,L.,Gomez,A.N.,Kaiser, L., Polosukhin, I.: Attention is all you need. In: Advances in Neural Information Processing Systems. pp. 6000–6010. NIPS’17, Curran Associates Inc., Red Hook, NY, USA (2017)

  59. [59]

    In: Conference on Robot Learning

    Walke, H.R., Black, K., Zhao, T.Z., Vuong, Q., Zheng, C., Hansen-Estruch, P., He, A.W., Myers, V., Kim, M.J., Du, M., et al.: BridgeData v2: A dataset for robot learning at scale. In: Conference on Robot Learning. pp. 1723–1736 (2023)

  60. [60]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Wan Team, Wang, A., Ai, B., Wen, B., Mao, C., Xie, C.W., Chen, D., Yu, F., Zhao, H., Yang, J., et al.: Wan: Open and advanced large-scale video generative models. arXiv preprint arXiv:2503.20314 (2025)

  61. [61]

    arXiv preprint arXiv:2601.22153 (2026)

    Xie, H., Wen, B., Zheng, J., Chen, Z., Hong, F., Diao, H., Liu, Z.: DynamicVLA: A vision-language-action model for dynamic object manipulation. arXiv preprint arXiv:2601.22153 (2026)

  62. [62]

    Qwen2.5-1M Technical Report

    Yang, A., Yu, B., Li, C., Liu, D., Huang, F., Huang, H., Jiang, J., Tu, J., Zhang, J., Zhou, J., Lin, J., Dang, K., Yang, K., Yu, L., Li, M., Sun, M., Zhu, Q., Men, 20 P. Yun et al. R., He, T., Xu, W., Yin, W., Yu, W., Qiu, X., Ren, X., Yang, X., Li, Y., Xu, Z., Zhang, Z.: Qwen2.5-1m technical report. arXiv preprint arXiv:2501.15383 (2025)

  63. [63]

    In: International Conference on Learning Representations the 2nd Workshop on World Models: Un- derstanding, Modelling and Scaling (2026)

    Ye, S., Ge, Y., Zheng, K., Gao, S., Yu, S., Kurian, G., Indupuru, S., Tan, Y.L., Zhu, C., Xiang, J., et al.: World action models are zero-shot policies. In: International Conference on Learning Representations the 2nd Workshop on World Models: Un- derstanding, Modelling and Scaling (2026)

  64. [64]

    arXiv preprint arXiv:2509.19012 (2025)

    Zhang, D., Sun, J., Hu, C., Wu, X., Yuan, Z., Zhou, R., Shen, F., Zhou, Q.: Pure vision language action (VLA) models: A comprehensive survey. arXiv preprint arXiv:2509.19012 (2025)

  65. [65]

    IEEE Robotics and Automation Letters10(6), 5209–5216 (2025)

    Zhang, Y., Wang, R., Chen, X.: Dynamic behavior cloning with temporal feature prediction: Enhancing robotic arm manipulation in moving object tasks. IEEE Robotics and Automation Letters10(6), 5209–5216 (2025)

  66. [66]

    IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) pp

    Zhou, S., Wang, H., Cheng, H., Li, J., Wang, D., Jiang, J., Jin, Y., Huang, J., Mao, S., Liu, S., Yang, Y., Song, H., Wei, S., Zhang, Z., Wang, B., Wang, Z., Zou, C., Yang, B.: PhysInOne: Visual Physics Learning and Reasoning in One Suite. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) pp. 33131–33142 (2026)

  67. [67]

    IEEE Transactions on Industrial Electronics71(7), 7466–7476 (2023)

    Zhou, Y., Sun, G., Miao, Y., Zhang, Y., Chen, X., Wang, H.: Spatiotemporal optimal trajectory planning for safe planar manipulation of a moving object. IEEE Transactions on Industrial Electronics71(7), 7466–7476 (2023)

  68. [68]

    Is Sora a world simulator? a comprehensive survey on general world models and beyond,

    Zhu, Z., Wang, X., Zhao, W., Min, C., Li, B., Deng, N., Dou, M., Wang, Y., Shi, B., Wang, K., et al.: Is Sora a world simulator? a comprehensive survey on general world models and beyond. arXiv preprint arXiv:2405.03520 (2024)

  69. [69]

    In: Conference on Robot Learning

    Zitkovich, B., Yu, T., Xu, S., Xu, P., Xiao, T., Xia, F., Wu, J., Wohlhart, P., Welker, S., Wahid, A., et al.: RT-2: Vision-language-action models transfer web knowledge to robotic control. In: Conference on Robot Learning. pp. 2165–2183 (2023) PhysMani 21 A Appendix A.1 Details of Physics-principled 3D Gaussian World Model Canonical 3D Gaussian Module: A...

  70. [70]

    +λ ssim(1−ℓ ssim(I c 0, ˆI c 0)) + (1−λ ssim)ℓ1(Dc 0, ˆDc

  71. [71]

    basketball: x: [0.0, 0.3], y: [-0.5, 0.5], z: 0.8

    +λ ssim(1−ℓ ssim(Dc 0, ˆDc 0)), (5) where{( ˆI c 0, ˆDc 0)}C c=1 are rendered RGB/Ds from canonical 3D Gaussians and λssim = 0.2. It takes about 15s on an RTX 4090 for PhysMani-Bench. 𝒈𝟎 ([1, 3])Positional Embedding ([1, 51])MLP(51, 128), ReLUMLP(128,128), ReLUMLP(128,128), ReLUConcatenate ([N,128+51])MLP(179, 128),ReLUMLP(128, 16)MLP(16, 64), ReLUMLP(64,...