PhysAgent: Automating Physics-Based 4D Synthesis via Trajectory-Grounded Multi-Agent Feedback

Changsheng Li; Chunji Lv; Jiaxi Ye; Rexar Lin; Yuchen Jiang

arxiv: 2606.08688 · v1 · pith:7Z5OBMNVnew · submitted 2026-06-07 · 💻 cs.RO · cs.CV

PhysAgent: Automating Physics-Based 4D Synthesis via Trajectory-Grounded Multi-Agent Feedback

Chunji Lv , Jiaxi Ye , Yuchen Jiang , Rexar Lin , Changsheng Li This is my paper

Pith reviewed 2026-06-27 18:22 UTC · model grok-4.3

classification 💻 cs.RO cs.CV

keywords physics-based simulation4D scene synthesismulti-agent frameworkforce field optimizationtrajectory feedbackgenerative modelingphysically plausible motion

0 comments

The pith

A multi-agent framework converts vision-tracked trajectories into text so language models can dynamically switch force fields and automate physical 4D scene synthesis.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents PhysAgent as a way to remove manual expert tuning from physics simulation setup. It separates material properties from external forces and routes rendered motion into structured text descriptions. Language-model agents then use those descriptions to make broad adjustments to the force fields that drive the simulation. This process runs inside a simulator loop and produces scenes from multimodal prompts without getting trapped in the local solutions that affect earlier optimization techniques.

Core claim

PhysAgent decouples intrinsic materials from extrinsic dynamics, employs a Semantic Agent with an externalized Force Field Skill module to produce valid initializations, and then applies Refine Agents that extract dense point trajectories from rendered frames via vision foundation models, convert those trajectories into structured textual descriptors, and harness LLM commonsense reasoning to perform zero-shot macroscopic leaps that escape local optima while dynamically switching discrete force fields.

What carries the argument

Trajectory-Grounded Multi-Agent Feedback, which turns dense point trajectories from vision models into textual descriptors that enable LLM reasoning to adjust discrete force fields inside the simulation loop.

If this is right

Large-scale production of physically stable simulation data becomes possible from arbitrary multimodal inputs without per-scene expert configuration.
The separation of material and force optimization allows independent refinement of each component inside the same loop.
Zero-shot force-field switching removes the need for continuous gradient signals that SDS methods require.
Diversity and physical accuracy both increase relative to baselines that rely on material optimization alone.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same trajectory-to-text conversion step could be reused to debug or correct simulations after initial generation rather than only during creation.
If the textual descriptors capture enough motion statistics, the approach might transfer to domains that use different underlying engines, such as rigid-body or fluid solvers.
Combining the feedback loop with real sensor data from physical robots could close the sim-to-real gap for policy training.

Load-bearing premise

Converting dense point trajectories into structured text gives language models enough information to make correct zero-shot decisions about which force fields to apply or switch.

What would settle it

A side-by-side test in which PhysAgent-generated scenes exhibit the same physical violations or lower accuracy scores as SDS-optimized scenes on the same prompts.

Figures

Figures reproduced from arXiv: 2606.08688 by Changsheng Li, Chunji Lv, Jiaxi Ye, Rexar Lin, Yuchen Jiang.

**Figure 1.** Figure 1: Comparison of physics-based 4D synthesis paradigms. (a) SDS-based methods automate material optimization but suffer from high optimization cost, unstable generation and rely on manually crafted forces. (b) Naive generative approaches automate materials but lack physical feedback, leading to inaccuracies and neglecting environmental forces. (c) Our PhysAgent proposes a "simulator-in-theloop" paradigm. By a… view at source ↗

**Figure 2.** Figure 2: PhysAgent Framework Overview. Operating in a closed-loop “simulator-in-the-loop" paradigm, the system processes multimodal inputs. First, intrinsic materials and Gaussian anchors are extracted from the reference image to initialize the 3DGS representation and object physical properties. Concurrently, the Semantic Agent interprets the text prompt and queries the Force Field Skill Library to generate the for… view at source ↗

**Figure 3.** Figure 3: Qualitative results. Continuous physical dynamic responses generated by PhysAgent under [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Qualitative comparisons. Comparison of dynamic responses generated by PhysAgent and [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Qualitative ablation of the Refine Agents. “Before” denotes the open-loop generation, [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

read the original abstract

Achieving fully automated, physically plausible 3D motion synthesis is a core objective in graphics and generative AI. However, configuring complex environmental force fields still relies entirely on manual expert intervention, creating a severe bottleneck for large-scale simulation data generation. Existing automated methods primarily focus on material optimization and exhibit severe modality gaps and technical flaws when applied to the vastly more complex force field optimization space: naive Large Language Models (LLMs) lack underlying simulation feedback, causing severe physical inaccuracies, while traditional Score Distillation Sampling (SDS) suffers from sluggish gradients, local optima entrapment, and a mathematical inability to dynamically switch discrete force fields. To address this, we propose PhysAgent, the first simulator-in-the-loop multi-agent framework that leverages multimodal inputs for automated, physically grounded 4D synthesis. By decoupling intrinsic materials from extrinsic dynamics, PhysAgent utilizes a Semantic Agent equipped with an externalized Force Field Skill module to master simulation rules and generate valid initializations. Subsequently, the Refine Agents, driven by Trajectory-Grounded Multi-Agent Feedback, leverage vision foundation models to extract dense point trajectories from rendered frames. By converting these explicit motion trajectories into structured textual descriptors, the agent harnesses LLM commonsense reasoning to execute zero-shot macroscopic leaps, effectively escaping local optima and dynamically switching discrete force fields. Extensive experiments demonstrate that PhysAgent rapidly generates stable, diverse physical scenes from arbitrary multimodal prompts, significantly outperforming existing baselines in both generation diversity and physical accuracy.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PhysAgent's multi-agent setup for automating force field tweaks via trajectory text conversion is a reasonable framing of the bottleneck, but the conversion step and lack of visible results leave the main claims untested.

read the letter

The paper's main contribution is a simulator-in-the-loop multi-agent system that splits material properties from force fields, uses a Semantic Agent for initial setups, and then Refine Agents that pull dense trajectories from renders, convert them to text, and let an LLM make discrete field switches. This directly targets the manual configuration problem in large-scale physics simulation data generation.

It does a clean job stating the limitations of prior LLM-only and SDS methods for this specific task. The decoupling of intrinsic and extrinsic elements is a sensible organizational move, and grounding the feedback in actual rendered trajectories is a logical step beyond pure language prompting.

The soft spot is the conversion of point trajectories into structured textual descriptors. The abstract describes the step but supplies no format, prompt template, or check that the text keeps the dynamic information needed for valid force-field decisions. Without that, it is unclear whether the LLM is truly performing reliable macroscopic leaps or simply reintroducing the modality gaps the paper criticizes elsewhere. The claim of extensive experiments showing better diversity and physical accuracy is stated but not supported by any numbers, baselines, or ablations in the text provided, so the soundness cannot be evaluated yet.

The work is aimed at people building automated pipelines for physics-based 4D content in graphics and robotics simulation. A reader already working on LLM agents for simulation control would find the framework sketch useful as a starting point.

It deserves a serious referee because the underlying problem is concrete and the proposed structure is distinct enough from existing lines to merit external scrutiny, even if the current draft would need substantial additions on the text representation and the reported results.

Referee Report

2 major / 0 minor

Summary. The paper proposes PhysAgent, the first simulator-in-the-loop multi-agent framework for automated, physically grounded 4D synthesis from arbitrary multimodal prompts. It decouples intrinsic materials from extrinsic dynamics via a Semantic Agent with an externalized Force Field Skill module for valid initializations, followed by Refine Agents that extract dense point trajectories from rendered frames using vision foundation models, convert these to structured textual descriptors, and apply LLM commonsense reasoning for zero-shot macroscopic leaps that dynamically switch discrete force fields and escape SDS local optima. The abstract claims this yields stable, diverse physical scenes that significantly outperform baselines in diversity and physical accuracy.

Significance. If the core mechanism holds, the approach could remove the manual-expert bottleneck in configuring complex environmental force fields, enabling scalable automated generation of physically plausible 4D content for graphics and simulation data. The simulator-in-the-loop multi-agent design with trajectory-grounded feedback and explicit decoupling of materials/dynamics represents a potentially useful direction beyond pure SDS or naive LLM methods.

major comments (2)

[Abstract] Abstract: the central claim that converting dense point trajectories into structured textual descriptors enables LLM commonsense reasoning to perform reliable zero-shot macroscopic leaps and physically valid discrete force-field switches is load-bearing, yet the manuscript supplies no format, prompting template, or mechanism ensuring the text representation preserves the necessary dynamics information rather than reintroducing modality gaps.
[Abstract] Abstract: the claim of significantly outperforming baselines in generation diversity and physical accuracy cannot be assessed because the manuscript provides no experimental details, quantitative results, baseline comparisons, implementation specifics, or ablations on the trajectory-to-text step.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We will revise the manuscript to supply the missing implementation details on the trajectory-to-text conversion and to make the experimental claims fully assessable.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that converting dense point trajectories into structured textual descriptors enables LLM commonsense reasoning to perform reliable zero-shot macroscopic leaps and physically valid discrete force-field switches is load-bearing, yet the manuscript supplies no format, prompting template, or mechanism ensuring the text representation preserves the necessary dynamics information rather than reintroducing modality gaps.

Authors: We agree the abstract summarizes the conversion at a high level without the concrete format or template. In revision we will add an explicit subsection (or appendix) that defines the structured textual descriptor schema, provides the exact prompting template passed to the LLM, and explains the design choices (e.g., inclusion of velocity vectors, contact events, and force-field state) intended to retain dynamic information and avoid modality gaps. revision: yes
Referee: [Abstract] Abstract: the claim of significantly outperforming baselines in generation diversity and physical accuracy cannot be assessed because the manuscript provides no experimental details, quantitative results, baseline comparisons, implementation specifics, or ablations on the trajectory-to-text step.

Authors: The current manuscript version presents the high-level claims in the abstract but does not embed the supporting quantitative tables, baseline numbers, or ablations. We will expand the experiments section (and, if space permits, the abstract) to include (i) the full set of diversity and physical-accuracy metrics, (ii) direct comparisons against the cited baselines, (iii) implementation hyperparameters, and (iv) a dedicated ablation isolating the trajectory-to-text component. These additions will make the performance claims directly verifiable. revision: yes

Circularity Check

0 steps flagged

No significant circularity; framework relies on external components

full rationale

The paper presents PhysAgent as a multi-agent simulator-in-the-loop system that extracts trajectories via external vision foundation models, converts them to text, and uses LLM reasoning for force-field decisions. No load-bearing step reduces by construction to a fitted parameter, self-defined quantity, or self-citation chain within the paper. The derivation chain invokes external models and commonsense reasoning rather than internal self-reference. This matches the default non-circular case; the method's validity depends on the accuracy of those external modules, which is a separate empirical question.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 2 invented entities

Based solely on the abstract; limited visibility into full set of assumptions or parameters.

axioms (2)

domain assumption Vision foundation models can accurately extract dense point trajectories from rendered frames
Invoked to supply motion information to Refine Agents.
domain assumption LLM commonsense reasoning applied to textual trajectory descriptors can execute zero-shot macroscopic leaps to adjust and switch discrete force fields
Central mechanism claimed to escape local optima.

invented entities (2)

Semantic Agent equipped with externalized Force Field Skill module no independent evidence
purpose: Master simulation rules and generate valid initializations by decoupling materials from dynamics
Introduced as core component of the framework.
Refine Agents driven by Trajectory-Grounded Multi-Agent Feedback no independent evidence
purpose: Refine synthesis by converting trajectories to text and using LLM reasoning for force field adjustments
Introduced as core component of the framework.

pith-pipeline@v0.9.1-grok · 5802 in / 1616 out tokens · 33953 ms · 2026-06-27T18:22:35.077818+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

66 extracted references · 24 canonical work pages · 12 internal anchors

[1]

Physgaussian: Physics-integrated 3d gaussians for generative dynamics

Tianyi Xie, Zeshun Zong, Yuxing Qiu, Xuan Li, Yutao Feng, Yin Yang, and Chenfanfu Jiang. Physgaussian: Physics-integrated 3d gaussians for generative dynamics. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4389–4398, 2024

2024
[2]

Physgm: Large physical gaussian model for feed-forward 4d synthesis.arXiv preprint arXiv:2508.13911, 2025

Chunji Lv, Zequn Chen, Donglin Di, Weinan Zhang, Hao Li, Wei Chen, Yinjie Lei, and Changsheng Li. Physgm: Large physical gaussian model for feed-forward 4d synthesis.arXiv preprint arXiv:2508.13911, 2025

work page arXiv 2025
[3]

Dreamphysics: Learning physics-based 3d dynamics with video diffusion priors

Tianyu Huang, Haoze Zhang, Yihan Zeng, Zhilu Zhang, Hui Li, Wangmeng Zuo, and Ryn- son WH Lau. Dreamphysics: Learning physics-based 3d dynamics with video diffusion priors. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 3733–3741, 2025

2025
[4]

arXiv preprint arXiv:2501.18982 , year=

Yuchen Lin, Chenguo Lin, Jianjin Xu, and Yadong Mu. Omniphysgs: 3d constitutive gaussians for general physics-based dynamics generation.arXiv preprint arXiv:2501.18982, 2025

work page arXiv 2025
[5]

Physdreamer: Physics-based interaction with 3d objects via video generation

Tianyuan Zhang, Hong-Xing Yu, Rundi Wu, Brandon Y Feng, Changxi Zheng, Noah Snavely, Jiajun Wu, and William T Freeman. Physdreamer: Physics-based interaction with 3d objects via video generation. InEuropean Conference on Computer Vision, pages 388–406. Springer, 2024

2024
[6]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al. π0: A Vision-Language-Action Flow Model for General Robot Control.arXiv preprint arXiv:2410.24164, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[7]

OpenVLA: An Open-Source Vision-Language-Action Model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[8]

DreamFusion: Text-to-3D using 2D Diffusion

Ben Poole, Ajay Jain, Jonathan T Barron, and Ben Mildenhall. Dreamfusion: Text-to-3d using 2d diffusion.arXiv preprint arXiv:2209.14988, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[9]

Physsplat: Efficient physics simulation for 3d scenes via mllm-guided gaussian splatting

Haoyu Zhao, Hao Wang, Xingyue Zhao, Hao Fei, Hongqiu Wang, Chengjiang Long, and Hua Zou. Physsplat: Efficient physics simulation for 3d scenes via mllm-guided gaussian splatting. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 5242–5252, 2025

2025
[10]

The material point method for simulating continuum materials

Chenfanfu Jiang, Craig Schroeder, Joseph Teran, Alexey Stomakhin, and Andrew Selle. The material point method for simulating continuum materials. InAcm siggraph 2016 courses, pages 1–52. 2016

2016
[11]

A material point method for snow simulation.ACM Transactions on Graphics (TOG), 32(4):1–10, 2013

Alexey Stomakhin, Craig Schroeder, Lawrence Chai, Joseph Teran, and Andrew Selle. A material point method for snow simulation.ACM Transactions on Graphics (TOG), 32(4):1–10, 2013

2013
[12]

Segment anything

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. In Proceedings of the IEEE/CVF international conference on computer vision, pages 4015–4026, 2023

2023
[13]

SAM 2: Segment Anything in Images and Videos

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Segment anything in images and videos.arXiv preprint arXiv:2408.00714, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[14]

Depth anything: Unleashing the power of large-scale unlabeled data

Lihe Yang, Bingyi Kang, Zilong Huang, Xiaogang Xu, Jiashi Feng, and Hengshuang Zhao. Depth anything: Unleashing the power of large-scale unlabeled data. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10371–10381, 2024

2024
[15]

Depth Anything 3: Recovering the Visual Space from Any Views

Haotong Lin, Sili Chen, Jun Hao Liew, Donny Y . Chen, Zhenyu Li, Guang Shi, Jiashi Feng, and Bingyi Kang. Depth anything 3: Recovering the visual space from any views.arXiv preprint arXiv:2511.10647, 2025. 10

work page internal anchor Pith review Pith/arXiv arXiv 2025
[16]

Cotracker3: Simpler and better point tracking by pseudo-labelling real videos

Nikita Karaev, Yuri Makarov, Jianyuan Wang, Natalia Neverova, Andrea Vedaldi, and Christian Rupprecht. Cotracker3: Simpler and better point tracking by pseudo-labelling real videos. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6013–6022, 2025

2025
[17]

Cotracker: It is better to track together

Nikita Karaev, Ignacio Rocco, Benjamin Graham, Natalia Neverova, Andrea Vedaldi, and Christian Rupprecht. Cotracker: It is better to track together. InEuropean conference on computer vision, pages 18–35. Springer, 2024

2024
[18]

3d gaussian splatting for real-time radiance field rendering.ACM Trans

Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, George Drettakis, et al. 3d gaussian splatting for real-time radiance field rendering.ACM Trans. Graph., 42(4):139–1, 2023

2023
[19]

Shap-E: Generating Conditional 3D Implicit Functions

Heewoo Jun and Alex Nichol. Shap-e: Generating conditional 3d implicit functions.arXiv preprint arXiv:2305.02463, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[20]

Point-E: A System for Generating 3D Point Clouds from Complex Prompts

Alex Nichol, Heewoo Jun, Prafulla Dhariwal, Pamela Mishkin, and Mark Chen. Point-e: A system for generating 3d point clouds from complex prompts.arXiv preprint arXiv:2212.08751, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[21]

In Computer Graphics Forum, volume 36, pages 1–12

The shape variational autoencoder: A deep generative model of part-segmented 3d objects. In Computer Graphics Forum, volume 36, pages 1–12. Wiley Online Library, 2017

2017
[22]

Single-stage diffusion nerf: A unified approach to 3d generation and reconstruction

Hansheng Chen, Jiatao Gu, Anpei Chen, Wei Tian, Zhuowen Tu, Lingjie Liu, and Hao Su. Single-stage diffusion nerf: A unified approach to 3d generation and reconstruction. InPro- ceedings of the IEEE/CVF international conference on computer vision, pages 2416–2425, 2023

2023
[23]

Lgm: Large multi-view gaussian model for high-resolution 3d content creation

Jiaxiang Tang, Zhaoxi Chen, Xiaokang Chen, Tengfei Wang, Gang Zeng, and Ziwei Liu. Lgm: Large multi-view gaussian model for high-resolution 3d content creation. InEuropean Conference on Computer Vision, pages 1–18. Springer, 2024

2024
[24]

LRM: Large Reconstruction Model for Single Image to 3D

Yicong Hong, Kai Zhang, Jiuxiang Gu, Sai Bi, Yang Zhou, Difan Liu, Feng Liu, Kalyan Sunkavalli, Trung Bui, and Hao Tan. Lrm: Large reconstruction model for single image to 3d. arXiv preprint arXiv:2311.04400, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[25]

Long-lrm: Long-sequence large reconstruction model for wide-coverage gaussian splats

Chen Ziwen, Hao Tan, Kai Zhang, Sai Bi, Fujun Luan, Yicong Hong, Li Fuxin, and Zexiang Xu. Long-lrm: Long-sequence large reconstruction model for wide-coverage gaussian splats. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4349–4359, 2025

2025
[26]

Gs-lrm: Large reconstruction model for 3d gaussian splatting

Kai Zhang, Sai Bi, Hao Tan, Yuanbo Xiangli, Nanxuan Zhao, Kalyan Sunkavalli, and Zexiang Xu. Gs-lrm: Large reconstruction model for 3d gaussian splatting. InEuropean Conference on Computer Vision, pages 1–19. Springer, 2024

2024
[27]

Motiongs: Exploring explicit motion guidance for deformable 3d gaussian splatting.Advances in Neural Information Processing Systems, 37:101790–101817, 2024

Ruijie Zhu, Yanzhe Liang, Hanzhi Chang, Jiacheng Deng, Jiahao Lu, Wenfei Yang, Tianzhu Zhang, and Yongdong Zhang. Motiongs: Exploring explicit motion guidance for deformable 3d gaussian splatting.Advances in Neural Information Processing Systems, 37:101790–101817, 2024

2024
[28]

Align your gaussians: Text-to-4d with dynamic 3d gaussians and composed diffusion models

Huan Ling, Seung Wook Kim, Antonio Torralba, Sanja Fidler, and Karsten Kreis. Align your gaussians: Text-to-4d with dynamic 3d gaussians and composed diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8576–8588, 2024

2024
[29]

arXiv preprint arXiv:2312.17142 , year=

Jiawei Ren, Liang Pan, Jiaxiang Tang, Chi Zhang, Ang Cao, Gang Zeng, and Ziwei Liu. Dreamgaussian4d: Generative 4d gaussian splatting.arXiv preprint arXiv:2312.17142, 2023

work page arXiv 2023
[30]

Improved distribution matching distillation for fast image synthesis

Tianwei Yin, Michaël Gharbi, Taesung Park, Richard Zhang, Eli Shechtman, Fredo Durand, and William T Freeman. Improved distribution matching distillation for fast image synthesis. Advances in neural information processing systems, 37:47455–47487, 2024

2024
[31]

4dgen: Grounded 4d content generation with spatial-temporal consistency.arXiv preprint arXiv:2312.17225, 2023

Yuyang Yin, Dejia Xu, Zhangyang Wang, Yao Zhao, and Yunchao Wei. 4dgen: Grounded 4d content generation with spatial-temporal consistency.arXiv preprint arXiv:2312.17225, 2023. 11

work page arXiv 2023
[32]

4diffusion: Multi-view video diffusion model for 4d generation.Advances in Neural Information Processing Systems, 37:15272–15295, 2024

Haiyu Zhang, Xinyuan Chen, Yaohui Wang, Xihui Liu, Yunhong Wang, and Yu Qiao. 4diffusion: Multi-view video diffusion model for 4d generation.Advances in Neural Information Processing Systems, 37:15272–15295, 2024

2024
[33]

Animate3d: Animating any 3d model with multi-view video diffusion.Advances in Neural Information Processing Systems, 37:125879–125906, 2024

Yanqin Jiang, Chaohui Yu, Chenjie Cao, Fan Wang, Weiming Hu, and Jin Gao. Animate3d: Animating any 3d model with multi-view video diffusion.Advances in Neural Information Processing Systems, 37:125879–125906, 2024

2024
[34]

Efficient4d: Fast dynamic 3d object generation from a single-view video.International Journal of Computer Vision, 134(1):14, 2026

Zijie Pan, Zeyu Yang, Xiatian Zhu, and Li Zhang. Efficient4d: Fast dynamic 3d object generation from a single-view video.International Journal of Computer Vision, 134(1):14, 2026

2026
[35]

Physctrl: Generative physics for controllable and physics-grounded video generation.arXiv preprint arXiv:2509.20358, 2025

Chen Wang, Chuhao Chen, Yiming Huang, Zhiyang Dou, Yuan Liu, Jiatao Gu, and Lingjie Liu. Physctrl: Generative physics for controllable and physics-grounded video generation.arXiv preprint arXiv:2509.20358, 2025

work page arXiv 2025
[36]

Lome: Learning human-object manipulation with action-conditioned egocentric world model.arXiv preprint arXiv:2603.27449, 2026

Quankai Gao, Jiawei Yang, Qiangeng Xu, Le Chen, and Yue Wang. Lome: Learning human-object manipulation with action-conditioned egocentric world model.arXiv preprint arXiv:2603.27449, 2026

work page arXiv 2026
[37]

Force prompting: Video generation models can learn and generalize physics-based control signals.arXiv preprint arXiv:2505.19386, 2025

Nate Gillman, Charles Herrmann, Michael Freeman, Daksh Aggarwal, Evan Luo, Deqing Sun, and Chen Sun. Force prompting: Video generation models can learn and generalize physics-based control signals.arXiv preprint arXiv:2505.19386, 2025

work page arXiv 2025
[38]

Warp: A high-performance python framework for gpu simulation and graphics

Miles Macklin. Warp: A high-performance python framework for gpu simulation and graphics. InNVIDIA GPU Technology Conference (GTC), volume 3, 2022

2022
[39]

i- physgaussian: Implicit physical simulation for 3d gaussian splatting.arXiv preprint arXiv:2602.17117, 2026

Yicheng Cao, Zhuo Huang, Yu Yao, Yiming Ying, Daoyi Dong, and Tongliang Liu. i- physgaussian: Implicit physical simulation for 3d gaussian splatting.arXiv preprint arXiv:2602.17117, 2026

work page arXiv 2026
[40]

arXiv preprint arXiv:2411.17189 , year=

Xiyang Tan, Ying Jiang, Xuan Li, Zeshun Zong, Tianyi Xie, Yin Yang, and Chenfanfu Jiang. Physmotion: Physics-grounded dynamics from a single image.arXiv preprint arXiv:2411.17189, 2024

work page arXiv 2024
[41]

arXiv preprint arXiv:2406.04338 , year=

Fangfu Liu, Hanyang Wang, Shunyu Yao, Shengjun Zhang, Jie Zhou, and Yueqi Duan. Physics3d: Learning physical properties of 3d gaussians via video diffusion.arXiv preprint arXiv:2406.04338, 2024

work page arXiv 2024
[42]

Motionphysics: Learnable motion distillation for text-guided simulation

Miaowei Wang, Jakub Zadro ˙zny, Oisin Mac Aodha, and Amir Vaxman. Motionphysics: Learnable motion distillation for text-guided simulation. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 9993–10001, 2026

2026
[43]

Physgen3d: Crafting a miniature interactive world from a single image

Boyuan Chen, Hanxiao Jiang, Shaowei Liu, Saurabh Gupta, Yunzhu Li, Hao Zhao, and Shenlong Wang. Physgen3d: Crafting a miniature interactive world from a single image. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6178–6189, 2025

2025
[44]

Physgen: Rigid-body physics-grounded image-to-video generation

Shaowei Liu, Zhongzheng Ren, Saurabh Gupta, and Shenlong Wang. Physgen: Rigid-body physics-grounded image-to-video generation. InEuropean Conference on Computer Vision, pages 360–378. Springer, 2024

2024
[45]

Mosiv: Multi-object system identification from videos.arXiv preprint arXiv:2603.06022, 2026

Chunjiang Liu, Xiaoyuan Wang, Qingran Lin, Albert Xiao, Haoyu Chen, Shizheng Wen, Hao Zhang, Lu Qi, Ming-Hsuan Yang, Laszlo A Jeni, et al. Mosiv: Multi-object system identification from videos.arXiv preprint arXiv:2603.06022, 2026

work page arXiv 2026
[46]

Fastphysgs: Accelerating physics-based dynamic 3dgs simulation via interior completion and adaptive optimization.arXiv preprint arXiv:2602.01723, 2026

Yikun Ma, Yiqing Li, Jingwen Ye, Zhongkai Wu, Weidong Zhang, Lin Gao, and Zhi Jin. Fastphysgs: Accelerating physics-based dynamic 3dgs simulation via interior completion and adaptive optimization.arXiv preprint arXiv:2602.01723, 2026

work page arXiv 2026
[47]

PhysChoreo: Physics-Controllable Video Generation with Part-Aware Semantic Grounding

Haoze Zhang, Tianyu Huang, Zichen Wan, Xiaowei Jin, Hongzhi Zhang, Hui Li, and Wangmeng Zuo. Physchoreo: Physics-controllable video generation with part-aware semantic grounding. arXiv preprint arXiv:2511.20562, 2025. 12

work page internal anchor Pith review arXiv 2025
[48]

Voyager: An Open-Ended Embodied Agent with Large Language Models

Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. V oyager: An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[49]

ProgPrompt: Generating Situated Robot Task Plans using Large Language Models

Ishika Singh, Valts Blukis, Arsalan Mousavian, Ankit Goyal, Danfei Xu, Jonathan Tremblay, Dieter Fox, Jesse Thomason, and Animesh Garg. Progprompt: Generating situated robot task plans using large language models.arXiv preprint arXiv:2209.11302, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[50]

Toolformer: Language models can teach themselves to use tools.Advances in neural information processing systems, 36: 68539–68551, 2023

Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools.Advances in neural information processing systems, 36: 68539–68551, 2023

2023
[51]

Eureka: Human-Level Reward Design via Coding Large Language Models

Yecheng Jason Ma, William Liang, Guanzhi Wang, De-An Huang, Osbert Bastani, Dinesh Jayaraman, Yuke Zhu, Linxi Fan, and Anima Anandkumar. Eureka: Human-level reward design via coding large language models.arXiv preprint arXiv:2310.12931, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[52]

Grove: A generalized reward for learning open-vocabulary physical skill

Jieming Cui, Tengyu Liu, Ziyu Meng, Jiale Yu, Ran Song, Wei Zhang, Yixin Zhu, and Siyuan Huang. Grove: A generalized reward for learning open-vocabulary physical skill. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 15781–15790, 2025

2025
[53]

Layoutgpt: Compositional visual planning and generation with large language models.Advances in Neural Information Processing Systems, 36:18225–18250, 2023

Weixi Feng, Wanrong Zhu, Tsu-jui Fu, Varun Jampani, Arjun Akula, Xuehai He, Sugato Basu, Xin Eric Wang, and William Yang Wang. Layoutgpt: Compositional visual planning and generation with large language models.Advances in Neural Information Processing Systems, 36:18225–18250, 2023

2023
[54]

Holodeck: Language guided generation of 3d embodied ai environments

Yue Yang, Fan-Yun Sun, Luca Weihs, Eli VanderBilt, Alvaro Herrasti, Winson Han, Jiajun Wu, Nick Haber, Ranjay Krishna, Lingjie Liu, et al. Holodeck: Language guided generation of 3d embodied ai environments. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16227–16237, 2024

2024
[55]

Phyt2v: Llm-guided iterative self- refinement for physics-grounded text-to-video generation

Qiyao Xue, Xiangyu Yin, Boyuan Yang, and Wei Gao. Phyt2v: Llm-guided iterative self- refinement for physics-grounded text-to-video generation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 18826–18836, 2025

2025
[56]

Qwen3.5: Towards native multimodal agents, February 2026

Qwen Team. Qwen3.5: Towards native multimodal agents, February 2026

2026
[57]

Objaverse: A universe of annotated 3d objects

Matt Deitke, Dustin Schwenk, Jordi Salvador, Luca Weihs, Oscar Michel, Eli VanderBilt, Ludwig Schmidt, Kiana Ehsani, Aniruddha Kembhavi, and Ali Farhadi. Objaverse: A universe of annotated 3d objects. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 13142–13153, 2023

2023
[58]

Vector Field Parser

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 13 Appendix A Semantic Agent and Force Field Skill Lib...

2021
[59]

default_drop

Output Schema Definition Your final configuration must be formatted as a JSON block matching the following schema. Multiple actions can be overlaid or sequenced within theactionsarray. { "default_drop": boolean, "actions": [ { "action_type": "translation" | "scale" | "impulse" | "torque", "vector": [x, y, z], "magnitude": float, "active_time": [start_time...
[60]

Mapped to a continuous external force field ( fext) accumulated over the grid update phase

Physical Field Mapping & MPM Integration • translation: For pushing, pulling, or blowing. Mapped to a continuous external force field ( fext) accumulated over the grid update phase. • scale: For squeezing or stretching. Mapped to a spatially-varying force field scaling along the normal axis.magnitude>0denotes outward stretching;<0denotes inward compressio...
[61]

slightly

Parameter Constraints •Coordinate System:Thevectormust be strictly resolved and normalized into a 3D unit vector. • Magnitude (M):The auxiliary value of M ranges from 0.4 to 1.6. Default is 1.0. Modulate mono- tonically based on linguistic intensity modifiers (e.g., “slightly”→ lower bound, “violently” → upper bound). Reverse sign for opposing semantics (...
[62]

However, the final mechanical parameters MUST be enclosed within a valid JSON block

Execution Constraint You may reason step-by-step to analyze coordinates and intensities. However, the final mechanical parameters MUST be enclosed within a valid JSON block. 14 A.2 Force Field Skill Library Formulations TheForce Field Skill Libraryserves as the deterministic compiler that bridges the semantic JSON outputs and the continuous MPM simulator....
[63]

The Gaussian means ( µ) are directly mapped to initial particle positions ( xp)

Geometric Anchoring (3DGS to Particles):The explicit 3D Gaussian Splatting (3DGS) repre- sentation is converted into Lagrangian particles. The Gaussian means ( µ) are directly mapped to initial particle positions ( xp). The particle volumes ( Vp) and densities are derived from the Gaussian scales (s) and opacities ( α). These initialized arrays are cached...
[64]

This includes the material type index (e.g., 0 for jelly, 1 for metal, 2 for sand) and its corresponding Young’s modulus (E) and Poisson’s ratio (ν)

Intrinsic Materials:The material properties extracted via PhysGM are mapped to the constitutive models defined in our simulator. This includes the material type index (e.g., 0 for jelly, 1 for metal, 2 for sand) and its corresponding Young’s modulus (E) and Poisson’s ratio (ν)
[65]

Extrinsic Force Fields:The compiled skills (translation, scale, impulse, torque) and their temporal constraints (active_time) generated by the Semantic Agent are directly injected
[66]

simulation-in- the-loop

Global Hyperparameters & Boundaries:Default simulation attributes are appended to ensure numerical stability. This includes the background grid resolution ( n_grid= 100 ), grid limits, sub-step size (∆tsub), base gravitational acceleration, and default boundary conditions (e.g., a frictional ground collision plane). B Refine Agents Workflow Unlike traditi...

[1] [1]

Physgaussian: Physics-integrated 3d gaussians for generative dynamics

Tianyi Xie, Zeshun Zong, Yuxing Qiu, Xuan Li, Yutao Feng, Yin Yang, and Chenfanfu Jiang. Physgaussian: Physics-integrated 3d gaussians for generative dynamics. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4389–4398, 2024

2024

[2] [2]

Physgm: Large physical gaussian model for feed-forward 4d synthesis.arXiv preprint arXiv:2508.13911, 2025

Chunji Lv, Zequn Chen, Donglin Di, Weinan Zhang, Hao Li, Wei Chen, Yinjie Lei, and Changsheng Li. Physgm: Large physical gaussian model for feed-forward 4d synthesis.arXiv preprint arXiv:2508.13911, 2025

work page arXiv 2025

[3] [3]

Dreamphysics: Learning physics-based 3d dynamics with video diffusion priors

Tianyu Huang, Haoze Zhang, Yihan Zeng, Zhilu Zhang, Hui Li, Wangmeng Zuo, and Ryn- son WH Lau. Dreamphysics: Learning physics-based 3d dynamics with video diffusion priors. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 3733–3741, 2025

2025

[4] [4]

arXiv preprint arXiv:2501.18982 , year=

Yuchen Lin, Chenguo Lin, Jianjin Xu, and Yadong Mu. Omniphysgs: 3d constitutive gaussians for general physics-based dynamics generation.arXiv preprint arXiv:2501.18982, 2025

work page arXiv 2025

[5] [5]

Physdreamer: Physics-based interaction with 3d objects via video generation

Tianyuan Zhang, Hong-Xing Yu, Rundi Wu, Brandon Y Feng, Changxi Zheng, Noah Snavely, Jiajun Wu, and William T Freeman. Physdreamer: Physics-based interaction with 3d objects via video generation. InEuropean Conference on Computer Vision, pages 388–406. Springer, 2024

2024

[6] [6]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al. π0: A Vision-Language-Action Flow Model for General Robot Control.arXiv preprint arXiv:2410.24164, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[7] [7]

OpenVLA: An Open-Source Vision-Language-Action Model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[8] [8]

DreamFusion: Text-to-3D using 2D Diffusion

Ben Poole, Ajay Jain, Jonathan T Barron, and Ben Mildenhall. Dreamfusion: Text-to-3d using 2d diffusion.arXiv preprint arXiv:2209.14988, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[9] [9]

Physsplat: Efficient physics simulation for 3d scenes via mllm-guided gaussian splatting

Haoyu Zhao, Hao Wang, Xingyue Zhao, Hao Fei, Hongqiu Wang, Chengjiang Long, and Hua Zou. Physsplat: Efficient physics simulation for 3d scenes via mllm-guided gaussian splatting. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 5242–5252, 2025

2025

[10] [10]

The material point method for simulating continuum materials

Chenfanfu Jiang, Craig Schroeder, Joseph Teran, Alexey Stomakhin, and Andrew Selle. The material point method for simulating continuum materials. InAcm siggraph 2016 courses, pages 1–52. 2016

2016

[11] [11]

A material point method for snow simulation.ACM Transactions on Graphics (TOG), 32(4):1–10, 2013

Alexey Stomakhin, Craig Schroeder, Lawrence Chai, Joseph Teran, and Andrew Selle. A material point method for snow simulation.ACM Transactions on Graphics (TOG), 32(4):1–10, 2013

2013

[12] [12]

Segment anything

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. In Proceedings of the IEEE/CVF international conference on computer vision, pages 4015–4026, 2023

2023

[13] [13]

SAM 2: Segment Anything in Images and Videos

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Segment anything in images and videos.arXiv preprint arXiv:2408.00714, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[14] [14]

Depth anything: Unleashing the power of large-scale unlabeled data

Lihe Yang, Bingyi Kang, Zilong Huang, Xiaogang Xu, Jiashi Feng, and Hengshuang Zhao. Depth anything: Unleashing the power of large-scale unlabeled data. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10371–10381, 2024

2024

[15] [15]

Depth Anything 3: Recovering the Visual Space from Any Views

Haotong Lin, Sili Chen, Jun Hao Liew, Donny Y . Chen, Zhenyu Li, Guang Shi, Jiashi Feng, and Bingyi Kang. Depth anything 3: Recovering the visual space from any views.arXiv preprint arXiv:2511.10647, 2025. 10

work page internal anchor Pith review Pith/arXiv arXiv 2025

[16] [16]

Cotracker3: Simpler and better point tracking by pseudo-labelling real videos

Nikita Karaev, Yuri Makarov, Jianyuan Wang, Natalia Neverova, Andrea Vedaldi, and Christian Rupprecht. Cotracker3: Simpler and better point tracking by pseudo-labelling real videos. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6013–6022, 2025

2025

[17] [17]

Cotracker: It is better to track together

Nikita Karaev, Ignacio Rocco, Benjamin Graham, Natalia Neverova, Andrea Vedaldi, and Christian Rupprecht. Cotracker: It is better to track together. InEuropean conference on computer vision, pages 18–35. Springer, 2024

2024

[18] [18]

3d gaussian splatting for real-time radiance field rendering.ACM Trans

Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, George Drettakis, et al. 3d gaussian splatting for real-time radiance field rendering.ACM Trans. Graph., 42(4):139–1, 2023

2023

[19] [19]

Shap-E: Generating Conditional 3D Implicit Functions

Heewoo Jun and Alex Nichol. Shap-e: Generating conditional 3d implicit functions.arXiv preprint arXiv:2305.02463, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[20] [20]

Point-E: A System for Generating 3D Point Clouds from Complex Prompts

Alex Nichol, Heewoo Jun, Prafulla Dhariwal, Pamela Mishkin, and Mark Chen. Point-e: A system for generating 3d point clouds from complex prompts.arXiv preprint arXiv:2212.08751, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[21] [21]

In Computer Graphics Forum, volume 36, pages 1–12

The shape variational autoencoder: A deep generative model of part-segmented 3d objects. In Computer Graphics Forum, volume 36, pages 1–12. Wiley Online Library, 2017

2017

[22] [22]

Single-stage diffusion nerf: A unified approach to 3d generation and reconstruction

Hansheng Chen, Jiatao Gu, Anpei Chen, Wei Tian, Zhuowen Tu, Lingjie Liu, and Hao Su. Single-stage diffusion nerf: A unified approach to 3d generation and reconstruction. InPro- ceedings of the IEEE/CVF international conference on computer vision, pages 2416–2425, 2023

2023

[23] [23]

Lgm: Large multi-view gaussian model for high-resolution 3d content creation

Jiaxiang Tang, Zhaoxi Chen, Xiaokang Chen, Tengfei Wang, Gang Zeng, and Ziwei Liu. Lgm: Large multi-view gaussian model for high-resolution 3d content creation. InEuropean Conference on Computer Vision, pages 1–18. Springer, 2024

2024

[24] [24]

LRM: Large Reconstruction Model for Single Image to 3D

Yicong Hong, Kai Zhang, Jiuxiang Gu, Sai Bi, Yang Zhou, Difan Liu, Feng Liu, Kalyan Sunkavalli, Trung Bui, and Hao Tan. Lrm: Large reconstruction model for single image to 3d. arXiv preprint arXiv:2311.04400, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[25] [25]

Long-lrm: Long-sequence large reconstruction model for wide-coverage gaussian splats

Chen Ziwen, Hao Tan, Kai Zhang, Sai Bi, Fujun Luan, Yicong Hong, Li Fuxin, and Zexiang Xu. Long-lrm: Long-sequence large reconstruction model for wide-coverage gaussian splats. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4349–4359, 2025

2025

[26] [26]

Gs-lrm: Large reconstruction model for 3d gaussian splatting

Kai Zhang, Sai Bi, Hao Tan, Yuanbo Xiangli, Nanxuan Zhao, Kalyan Sunkavalli, and Zexiang Xu. Gs-lrm: Large reconstruction model for 3d gaussian splatting. InEuropean Conference on Computer Vision, pages 1–19. Springer, 2024

2024

[27] [27]

Motiongs: Exploring explicit motion guidance for deformable 3d gaussian splatting.Advances in Neural Information Processing Systems, 37:101790–101817, 2024

Ruijie Zhu, Yanzhe Liang, Hanzhi Chang, Jiacheng Deng, Jiahao Lu, Wenfei Yang, Tianzhu Zhang, and Yongdong Zhang. Motiongs: Exploring explicit motion guidance for deformable 3d gaussian splatting.Advances in Neural Information Processing Systems, 37:101790–101817, 2024

2024

[28] [28]

Align your gaussians: Text-to-4d with dynamic 3d gaussians and composed diffusion models

Huan Ling, Seung Wook Kim, Antonio Torralba, Sanja Fidler, and Karsten Kreis. Align your gaussians: Text-to-4d with dynamic 3d gaussians and composed diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8576–8588, 2024

2024

[29] [29]

arXiv preprint arXiv:2312.17142 , year=

Jiawei Ren, Liang Pan, Jiaxiang Tang, Chi Zhang, Ang Cao, Gang Zeng, and Ziwei Liu. Dreamgaussian4d: Generative 4d gaussian splatting.arXiv preprint arXiv:2312.17142, 2023

work page arXiv 2023

[30] [30]

Improved distribution matching distillation for fast image synthesis

Tianwei Yin, Michaël Gharbi, Taesung Park, Richard Zhang, Eli Shechtman, Fredo Durand, and William T Freeman. Improved distribution matching distillation for fast image synthesis. Advances in neural information processing systems, 37:47455–47487, 2024

2024

[31] [31]

4dgen: Grounded 4d content generation with spatial-temporal consistency.arXiv preprint arXiv:2312.17225, 2023

Yuyang Yin, Dejia Xu, Zhangyang Wang, Yao Zhao, and Yunchao Wei. 4dgen: Grounded 4d content generation with spatial-temporal consistency.arXiv preprint arXiv:2312.17225, 2023. 11

work page arXiv 2023

[32] [32]

4diffusion: Multi-view video diffusion model for 4d generation.Advances in Neural Information Processing Systems, 37:15272–15295, 2024

Haiyu Zhang, Xinyuan Chen, Yaohui Wang, Xihui Liu, Yunhong Wang, and Yu Qiao. 4diffusion: Multi-view video diffusion model for 4d generation.Advances in Neural Information Processing Systems, 37:15272–15295, 2024

2024

[33] [33]

Animate3d: Animating any 3d model with multi-view video diffusion.Advances in Neural Information Processing Systems, 37:125879–125906, 2024

Yanqin Jiang, Chaohui Yu, Chenjie Cao, Fan Wang, Weiming Hu, and Jin Gao. Animate3d: Animating any 3d model with multi-view video diffusion.Advances in Neural Information Processing Systems, 37:125879–125906, 2024

2024

[34] [34]

Efficient4d: Fast dynamic 3d object generation from a single-view video.International Journal of Computer Vision, 134(1):14, 2026

Zijie Pan, Zeyu Yang, Xiatian Zhu, and Li Zhang. Efficient4d: Fast dynamic 3d object generation from a single-view video.International Journal of Computer Vision, 134(1):14, 2026

2026

[35] [35]

Physctrl: Generative physics for controllable and physics-grounded video generation.arXiv preprint arXiv:2509.20358, 2025

Chen Wang, Chuhao Chen, Yiming Huang, Zhiyang Dou, Yuan Liu, Jiatao Gu, and Lingjie Liu. Physctrl: Generative physics for controllable and physics-grounded video generation.arXiv preprint arXiv:2509.20358, 2025

work page arXiv 2025

[36] [36]

Lome: Learning human-object manipulation with action-conditioned egocentric world model.arXiv preprint arXiv:2603.27449, 2026

Quankai Gao, Jiawei Yang, Qiangeng Xu, Le Chen, and Yue Wang. Lome: Learning human-object manipulation with action-conditioned egocentric world model.arXiv preprint arXiv:2603.27449, 2026

work page arXiv 2026

[37] [37]

Force prompting: Video generation models can learn and generalize physics-based control signals.arXiv preprint arXiv:2505.19386, 2025

Nate Gillman, Charles Herrmann, Michael Freeman, Daksh Aggarwal, Evan Luo, Deqing Sun, and Chen Sun. Force prompting: Video generation models can learn and generalize physics-based control signals.arXiv preprint arXiv:2505.19386, 2025

work page arXiv 2025

[38] [38]

Warp: A high-performance python framework for gpu simulation and graphics

Miles Macklin. Warp: A high-performance python framework for gpu simulation and graphics. InNVIDIA GPU Technology Conference (GTC), volume 3, 2022

2022

[39] [39]

i- physgaussian: Implicit physical simulation for 3d gaussian splatting.arXiv preprint arXiv:2602.17117, 2026

Yicheng Cao, Zhuo Huang, Yu Yao, Yiming Ying, Daoyi Dong, and Tongliang Liu. i- physgaussian: Implicit physical simulation for 3d gaussian splatting.arXiv preprint arXiv:2602.17117, 2026

work page arXiv 2026

[40] [40]

arXiv preprint arXiv:2411.17189 , year=

Xiyang Tan, Ying Jiang, Xuan Li, Zeshun Zong, Tianyi Xie, Yin Yang, and Chenfanfu Jiang. Physmotion: Physics-grounded dynamics from a single image.arXiv preprint arXiv:2411.17189, 2024

work page arXiv 2024

[41] [41]

arXiv preprint arXiv:2406.04338 , year=

Fangfu Liu, Hanyang Wang, Shunyu Yao, Shengjun Zhang, Jie Zhou, and Yueqi Duan. Physics3d: Learning physical properties of 3d gaussians via video diffusion.arXiv preprint arXiv:2406.04338, 2024

work page arXiv 2024

[42] [42]

Motionphysics: Learnable motion distillation for text-guided simulation

Miaowei Wang, Jakub Zadro ˙zny, Oisin Mac Aodha, and Amir Vaxman. Motionphysics: Learnable motion distillation for text-guided simulation. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 9993–10001, 2026

2026

[43] [43]

Physgen3d: Crafting a miniature interactive world from a single image

Boyuan Chen, Hanxiao Jiang, Shaowei Liu, Saurabh Gupta, Yunzhu Li, Hao Zhao, and Shenlong Wang. Physgen3d: Crafting a miniature interactive world from a single image. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6178–6189, 2025

2025

[44] [44]

Physgen: Rigid-body physics-grounded image-to-video generation

Shaowei Liu, Zhongzheng Ren, Saurabh Gupta, and Shenlong Wang. Physgen: Rigid-body physics-grounded image-to-video generation. InEuropean Conference on Computer Vision, pages 360–378. Springer, 2024

2024

[45] [45]

Mosiv: Multi-object system identification from videos.arXiv preprint arXiv:2603.06022, 2026

Chunjiang Liu, Xiaoyuan Wang, Qingran Lin, Albert Xiao, Haoyu Chen, Shizheng Wen, Hao Zhang, Lu Qi, Ming-Hsuan Yang, Laszlo A Jeni, et al. Mosiv: Multi-object system identification from videos.arXiv preprint arXiv:2603.06022, 2026

work page arXiv 2026

[46] [46]

Fastphysgs: Accelerating physics-based dynamic 3dgs simulation via interior completion and adaptive optimization.arXiv preprint arXiv:2602.01723, 2026

Yikun Ma, Yiqing Li, Jingwen Ye, Zhongkai Wu, Weidong Zhang, Lin Gao, and Zhi Jin. Fastphysgs: Accelerating physics-based dynamic 3dgs simulation via interior completion and adaptive optimization.arXiv preprint arXiv:2602.01723, 2026

work page arXiv 2026

[47] [47]

PhysChoreo: Physics-Controllable Video Generation with Part-Aware Semantic Grounding

Haoze Zhang, Tianyu Huang, Zichen Wan, Xiaowei Jin, Hongzhi Zhang, Hui Li, and Wangmeng Zuo. Physchoreo: Physics-controllable video generation with part-aware semantic grounding. arXiv preprint arXiv:2511.20562, 2025. 12

work page internal anchor Pith review arXiv 2025

[48] [48]

Voyager: An Open-Ended Embodied Agent with Large Language Models

Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. V oyager: An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[49] [49]

ProgPrompt: Generating Situated Robot Task Plans using Large Language Models

Ishika Singh, Valts Blukis, Arsalan Mousavian, Ankit Goyal, Danfei Xu, Jonathan Tremblay, Dieter Fox, Jesse Thomason, and Animesh Garg. Progprompt: Generating situated robot task plans using large language models.arXiv preprint arXiv:2209.11302, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[50] [50]

Toolformer: Language models can teach themselves to use tools.Advances in neural information processing systems, 36: 68539–68551, 2023

Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools.Advances in neural information processing systems, 36: 68539–68551, 2023

2023

[51] [51]

Eureka: Human-Level Reward Design via Coding Large Language Models

Yecheng Jason Ma, William Liang, Guanzhi Wang, De-An Huang, Osbert Bastani, Dinesh Jayaraman, Yuke Zhu, Linxi Fan, and Anima Anandkumar. Eureka: Human-level reward design via coding large language models.arXiv preprint arXiv:2310.12931, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[52] [52]

Grove: A generalized reward for learning open-vocabulary physical skill

Jieming Cui, Tengyu Liu, Ziyu Meng, Jiale Yu, Ran Song, Wei Zhang, Yixin Zhu, and Siyuan Huang. Grove: A generalized reward for learning open-vocabulary physical skill. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 15781–15790, 2025

2025

[53] [53]

Layoutgpt: Compositional visual planning and generation with large language models.Advances in Neural Information Processing Systems, 36:18225–18250, 2023

Weixi Feng, Wanrong Zhu, Tsu-jui Fu, Varun Jampani, Arjun Akula, Xuehai He, Sugato Basu, Xin Eric Wang, and William Yang Wang. Layoutgpt: Compositional visual planning and generation with large language models.Advances in Neural Information Processing Systems, 36:18225–18250, 2023

2023

[54] [54]

Holodeck: Language guided generation of 3d embodied ai environments

Yue Yang, Fan-Yun Sun, Luca Weihs, Eli VanderBilt, Alvaro Herrasti, Winson Han, Jiajun Wu, Nick Haber, Ranjay Krishna, Lingjie Liu, et al. Holodeck: Language guided generation of 3d embodied ai environments. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16227–16237, 2024

2024

[55] [55]

Phyt2v: Llm-guided iterative self- refinement for physics-grounded text-to-video generation

Qiyao Xue, Xiangyu Yin, Boyuan Yang, and Wei Gao. Phyt2v: Llm-guided iterative self- refinement for physics-grounded text-to-video generation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 18826–18836, 2025

2025

[56] [56]

Qwen3.5: Towards native multimodal agents, February 2026

Qwen Team. Qwen3.5: Towards native multimodal agents, February 2026

2026

[57] [57]

Objaverse: A universe of annotated 3d objects

Matt Deitke, Dustin Schwenk, Jordi Salvador, Luca Weihs, Oscar Michel, Eli VanderBilt, Ludwig Schmidt, Kiana Ehsani, Aniruddha Kembhavi, and Ali Farhadi. Objaverse: A universe of annotated 3d objects. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 13142–13153, 2023

2023

[58] [58]

Vector Field Parser

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 13 Appendix A Semantic Agent and Force Field Skill Lib...

2021

[59] [59]

default_drop

Output Schema Definition Your final configuration must be formatted as a JSON block matching the following schema. Multiple actions can be overlaid or sequenced within theactionsarray. { "default_drop": boolean, "actions": [ { "action_type": "translation" | "scale" | "impulse" | "torque", "vector": [x, y, z], "magnitude": float, "active_time": [start_time...

[60] [60]

Mapped to a continuous external force field ( fext) accumulated over the grid update phase

Physical Field Mapping & MPM Integration • translation: For pushing, pulling, or blowing. Mapped to a continuous external force field ( fext) accumulated over the grid update phase. • scale: For squeezing or stretching. Mapped to a spatially-varying force field scaling along the normal axis.magnitude>0denotes outward stretching;<0denotes inward compressio...

[61] [61]

slightly

Parameter Constraints •Coordinate System:Thevectormust be strictly resolved and normalized into a 3D unit vector. • Magnitude (M):The auxiliary value of M ranges from 0.4 to 1.6. Default is 1.0. Modulate mono- tonically based on linguistic intensity modifiers (e.g., “slightly”→ lower bound, “violently” → upper bound). Reverse sign for opposing semantics (...

[62] [62]

However, the final mechanical parameters MUST be enclosed within a valid JSON block

Execution Constraint You may reason step-by-step to analyze coordinates and intensities. However, the final mechanical parameters MUST be enclosed within a valid JSON block. 14 A.2 Force Field Skill Library Formulations TheForce Field Skill Libraryserves as the deterministic compiler that bridges the semantic JSON outputs and the continuous MPM simulator....

[63] [63]

The Gaussian means ( µ) are directly mapped to initial particle positions ( xp)

Geometric Anchoring (3DGS to Particles):The explicit 3D Gaussian Splatting (3DGS) repre- sentation is converted into Lagrangian particles. The Gaussian means ( µ) are directly mapped to initial particle positions ( xp). The particle volumes ( Vp) and densities are derived from the Gaussian scales (s) and opacities ( α). These initialized arrays are cached...

[64] [64]

This includes the material type index (e.g., 0 for jelly, 1 for metal, 2 for sand) and its corresponding Young’s modulus (E) and Poisson’s ratio (ν)

Intrinsic Materials:The material properties extracted via PhysGM are mapped to the constitutive models defined in our simulator. This includes the material type index (e.g., 0 for jelly, 1 for metal, 2 for sand) and its corresponding Young’s modulus (E) and Poisson’s ratio (ν)

[65] [65]

Extrinsic Force Fields:The compiled skills (translation, scale, impulse, torque) and their temporal constraints (active_time) generated by the Semantic Agent are directly injected

[66] [66]

simulation-in- the-loop

Global Hyperparameters & Boundaries:Default simulation attributes are appended to ensure numerical stability. This includes the background grid resolution ( n_grid= 100 ), grid limits, sub-step size (∆tsub), base gravitational acceleration, and default boundary conditions (e.g., a frictional ground collision plane). B Refine Agents Workflow Unlike traditi...