pith. sign in

arxiv: 2605.18396 · v2 · pith:WFOLMCNXnew · submitted 2026-05-18 · 💻 cs.CV

NEWTON: Agentic Planning for Physically Grounded Video Generation

Pith reviewed 2026-05-20 10:52 UTC · model grok-4.3

classification 💻 cs.CV
keywords video generationphysical commonsenseagentic planningphysics conditioningVideoPhy-2keyframe generationprompt refinementverifier-driven re-planning
0
0 comments X

The pith

An agentic planner improves physical accuracy in video generation by sequencing physics tools and verifying results instead of depending on text prompts alone.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Video generation models produce compelling images yet routinely break basic physical rules, achieving only about 32 percent joint accuracy on the VideoPhy-2 benchmark. The paper traces the problem to an inherent limitation: ordinary text prompts omit the precise parameters that govern real-world dynamics, so simply scaling models cannot fix what was never supplied. NEWTON treats video generation as one tool among several in an agent's repertoire; a learned planner chooses and orders actions such as keyframe creation, scientific computation, and prompt refinement, while a verifier checks outcomes and triggers re-planning until the conditioning satisfies sufficiency, dynamism, and verifiability. The planner is trained on-policy inside the live loop, and the entire system raises accuracy on two different generators without any change to the generators themselves.

Core claim

We present NEWTON, in which video generation is demoted from the system output to one action inside an agent's toolbox: a learned planner orchestrates physics-aware tools (keyframe generation, scientific computation, prompt refinement) to construct rich conditioning, and a verifier closes the loop for iterative re-planning. The planner is the sole trainable component, optimized on-policy via Flow-GRPO inside the live multi-turn loop.

What carries the argument

The NEWTON learned planner that selects and sequences physics-aware tools inside a verifier-driven re-planning loop to produce conditioning that meets sufficiency, dynamism, and verifiability.

If this is right

  • Joint accuracy on VideoPhy-2 rises from 21.4 percent to 29.7 percent for LTX-Video and from 30.7 percent to 37.4 percent for Veo-3.1.
  • The same gains appear without any modification to the underlying video generators.
  • The three required properties for physics conditioning—sufficiency, dynamism, and verifiability—are satisfied through tool orchestration rather than prompt engineering alone.
  • The planner becomes the only component that needs training, keeping the rest of the system fixed.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same planner-plus-verifier structure could be tested on other generative tasks that require external physical constraints, such as 3D scene synthesis or action sequence planning.
  • The approach suggests that separating specification and verification from the core generator may be more scalable than attempting to embed all physics inside a single model.
  • On-policy optimization inside the live loop may transfer to other agentic systems where the cost of poor tool choices is high.

Load-bearing premise

The planner can reliably choose and order the correct tools so that verifier feedback produces measurable gains in physical accuracy.

What would settle it

Replace the learned planner with random tool selection and measure whether the reported accuracy gains on VideoPhy-2 disappear for LTX-Video and Veo-3.1.

Figures

Figures reproduced from arXiv: 2605.18396 by Baigui Sun, Chao Xu, Huihan Wang, Juncheng Wang, Shujun Wang, Wenlong Hou, Yang Liu, Yijie Qian, Yong Liu, Yuxiang Feng.

Figure 1
Figure 1. Figure 1: Three paradigms for physically grounded video gener [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Left: a bottle of beer is poured into a mug until it is full—our method (top) renders progressive filling with foam buildup, while [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Text prompts are a lossy compression of physics: the [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: NEWTON overview. Left: the iterative pipeline. A user query and toolkit set initialize Cycle-1; at each cycle the Planner (trainable) reads the memory pool, selects tools, and the Executor dispatches them alongside the frozen video generator. The Verifier (frozen) scores the result on SA and PC, appending feedback to memory for the next cycle. The best-scored video across T cycles is returned. Right: Flow-… view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative comparison on real-world samples. [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative comparison on animation samples. [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Results of human preference study. no visible pile, Hunyuan stops the stream mid-pour, and Wan2.2 sprinkles without accumulation. Right: grapefruit peeling—NEWTON renders the rind progressively separat￾ing from the flesh, while baselines either start pre-cut, per￾form an abrupt cut without peeling, or produce only a tiny slice [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Best-so-far PC and SA on VideoPhy-2 (590 prompts) [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗
read the original abstract

Video generation models produce visually compelling results but systematically violate physical commonsense -- on VideoPhy-2, the best model achieves only 32.6% joint accuracy. We identify a specification bottleneck: text prompts are lossy compression of the physical world, omitting the parameters that fully determine dynamics, and no amount of model scaling can recover what was never specified. From this diagnosis we derive three properties that physics conditioning must satisfy -- sufficiency, dynamism, and verifiability -- and show that no existing approach satisfies all three. We present NEWTON, in which video generation is demoted from the system output to one action inside an agent's toolbox: a learned planner orchestrates physics-aware tools (keyframe generation, scientific computation, prompt refinement) to construct rich conditioning, and a verifier closes the loop for iterative re-planning. The planner is the sole trainable component, optimized on-policy via Flow-GRPO inside the live multi-turn loop. On VideoPhy-2, NEWTON improves joint accuracy from 21.4% to 29.7% on LTX-Video and from 30.7% to 37.4% on Veo-3.1, without modifying either generator. Our project page: https://Newton026.github.io/newton

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents NEWTON, a system for improving physical commonsense in video generation models. It diagnoses text prompts as a specification bottleneck and introduces an agentic planner that uses physics-aware tools (keyframe generation, scientific computation, prompt refinement) orchestrated in a verifier-driven re-planning loop. The planner is trained on-policy with Flow-GRPO. On the VideoPhy-2 benchmark, it reports joint accuracy improvements from 21.4% to 29.7% for LTX-Video and 30.7% to 37.4% for Veo-3.1, without altering the underlying video generators.

Significance. If the experimental results hold under proper controls, this work offers a promising direction for addressing physical inaccuracies in video generation by shifting from direct generation to agentic planning with verifiable tools. A notable strength is the decision to optimize only the planner while leaving the generators unmodified, along with the on-policy optimization inside the live multi-turn loop. This could have broader implications for grounded AI systems in vision and robotics.

major comments (2)
  1. [§5 (Experiments)] §5 (Experiments): The headline accuracy gains on VideoPhy-2 (21.4% → 29.7% joint accuracy on LTX-Video; 30.7% → 37.4% on Veo-3.1) are load-bearing for the central claim that the learned planner's tool sequencing produces the improvements. However, the manuscript does not report the mean number of generator invocations per test prompt for NEWTON versus the single-pass baselines, nor does it include a compute-matched baseline that draws the same number of videos and applies the same verifier filter. This leaves open the possibility that gains arise from extra sampling rather than the planner.
  2. [§4.2 (Training)] §4.2 (Training): The on-policy Flow-GRPO optimization is described at a high level, but the manuscript provides no details on how the verifier reward is aggregated across re-planning turns or how the policy handles variable-length trajectories; these choices directly affect whether the reported gains can be attributed to the planner rather than implementation-specific tuning.
minor comments (2)
  1. [§5 (Results)] Tables in §5 lack error bars or standard deviations across multiple runs, which would help assess the statistical reliability of the accuracy improvements.
  2. [§2 (Related Work)] The related work section would benefit from a more explicit comparison to prior agentic frameworks in vision that also use verifiers or tool orchestration.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and insightful comments on our manuscript. We address each major comment below and indicate the revisions we will make to strengthen the paper.

read point-by-point responses
  1. Referee: [§5 (Experiments)] §5 (Experiments): The headline accuracy gains on VideoPhy-2 (21.4% → 29.7% joint accuracy on LTX-Video; 30.7% → 37.4% on Veo-3.1) are load-bearing for the central claim that the learned planner's tool sequencing produces the improvements. However, the manuscript does not report the mean number of generator invocations per test prompt for NEWTON versus the single-pass baselines, nor does it include a compute-matched baseline that draws the same number of videos and applies the same verifier filter. This leaves open the possibility that gains arise from extra sampling rather than the planner.

    Authors: We agree that this is a valid concern and that additional controls would help isolate the contribution of the planner. The current manuscript focuses on end-to-end accuracy but does not report invocation counts or a compute-matched baseline. In the revised version we will add the mean number of generator invocations per test prompt for NEWTON (and note that baselines use one invocation). We will also include results from a compute-matched baseline that generates the same average number of videos per prompt and applies the verifier filter, allowing direct comparison to the planner-driven approach. revision: yes

  2. Referee: [§4.2 (Training)] §4.2 (Training): The on-policy Flow-GRPO optimization is described at a high level, but the manuscript provides no details on how the verifier reward is aggregated across re-planning turns or how the policy handles variable-length trajectories; these choices directly affect whether the reported gains can be attributed to the planner rather than implementation-specific tuning.

    Authors: We acknowledge that the training description in §4.2 is high-level and that more implementation details are needed for reproducibility and attribution. In the revised manuscript we will expand this section to specify how the verifier reward is aggregated over re-planning turns and how the policy manages variable-length trajectories. revision: yes

Circularity Check

0 steps flagged

Derivation chain independent of inputs; empirical gains on external benchmark

full rationale

The paper starts from a conceptual diagnosis of a specification bottleneck in text prompts, derives three required properties (sufficiency, dynamism, verifiability), and constructs NEWTON as an agentic system whose planner is the only trainable part. The central result is an empirical lift in joint accuracy on the independent VideoPhy-2 benchmark (21.4%→29.7% on LTX-Video; 30.7%→37.4% on Veo-3.1) against unmodified generators. On-policy Flow-GRPO optimization occurs inside the live loop, yet the reported metric is not defined by the planner's loss or by construction from the training signals. No equations, self-citations, or renamings reduce the claimed improvement to a tautology or to extra generator calls by definition. The system is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the diagnosis that text prompts are lossy and that no existing method satisfies sufficiency, dynamism, and verifiability; these are treated as domain assumptions rather than derived results.

axioms (2)
  • domain assumption Text prompts are lossy compression of the physical world, omitting parameters that fully determine dynamics.
    Stated directly in the abstract as the specification bottleneck.
  • domain assumption Physics conditioning must satisfy sufficiency, dynamism, and verifiability.
    Derived from the diagnosis and used to evaluate prior approaches.

pith-pipeline@v0.9.0 · 5786 in / 1115 out tokens · 26883 ms · 2026-05-20T10:52:31.660492+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · 16 internal anchors

  1. [1]

    World Simulation with Video Foundation Models for Physical AI

    Arslan Ali, Junjie Bai, Maciej Bala, Yogesh Balaji, Aaron Blakeman, Tiffany Cai, Jiaxin Cao, Tianshi Cao, Eliza- beth Cha, Yu-Wei Chao, et al. World simulation with video foundation models for physical ai.arXiv preprint arXiv:2511.00062, 2025. 6, 7

  2. [2]

    More- gen: Multi-agent motion-reasoning engine for code-based text-to-video synthesis.arXiv preprint arXiv:2512.04221,

    Xiangyu Bai, He Liang, Bishoy Galoaa, Utsav Nandi, Shayda Moezzi, Yuhang He, and Sarah Ostadabbas. More- gen: Multi-agent motion-reasoning engine for code-based text-to-video synthesis.arXiv preprint arXiv:2512.04221,

  3. [3]

    VideoPhy: Evaluating Physical Commonsense for Video Generation

    Hritik Bansal, Yonatan Bitton, Idan Szpektor, Kai-Wei Chang, and Aditya Grover. VideoPhy: Evaluating phys- ical commonsense for video generation.arXiv preprint arXiv:2406.03520, 2024. 1

  4. [4]

    VideoPhy-2: A challenging action-centric physical commonsense evaluation in video generation.arXiv preprint arXiv:2503.06800, 2025

    Hritik Bansal, Clark Peng, Yonatan Bitton, Roman Golden- berg, Aditya Grover, and Kai-Wei Chang. VideoPhy-2: A challenging action-centric physical commonsense evaluation in video generation.arXiv preprint arXiv:2503.06800, 2025. 1, 3, 4, 6

  5. [5]

    Video generation models as world simulators

    Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luh- man, Eric Luhman, Clarence Ng, Ricky Wang, and Aditya Ramesh. Video generation models as world simulators

  6. [6]

    Phygdpo: Physics-aware groupwise direct preference optimization for physically consistent text- to-video generation.arXiv preprint arXiv:2512.24551, 2025

    Yuanhao Cai, Kunpeng Li, Menglin Jia, Jialiang Wang, Jun- zhe Sun, Feng Liang, Weifeng Chen, Felix Juefei-Xu, Chu Wang, Ali Thabet, et al. Phygdpo: Physics-aware groupwise direct preference optimization for physically consistent text- to-video generation.arXiv preprint arXiv:2512.24551, 2025. 3

  7. [7]

    Phystalk: Language-driven real- time physics in 3d gaussian scenes.arXiv preprint arXiv:2512.24986, 2025

    Luca Collorone, Mert Kiray, Indro Spinelli, Fabio Galasso, and Benjamin Busam. Phystalk: Language-driven real- time physics in 3d gaussian scenes.arXiv preprint arXiv:2512.24986, 2025. 3

  8. [8]

    Agentic Video Generation: From Text to Executable Event Graphs via Tool-Constrained LLM Planning

    Nicolae Cudlenco, Mihai Masala, and Marius Leordeanu. Agentic video generation: From text to executable event graphs via tool-constrained llm planning.arXiv preprint arXiv:2604.10383, 2026. 3

  9. [9]

    Empowering Multi-Turn Tool-Integrated Agentic Reasoning with Group Turn Policy Optimization

    Yifeng Ding, Hung Le, Songyang Han, Kangrui Ruan, Zhenghui Jin, Varun Kumar, Zijian Wang, and Anoop Deoras. Empowering multi-turn tool-integrated reason- ing with group turn policy optimization.arXiv preprint arXiv:2511.14846, 2025. 3

  10. [10]

    Google DeepMind. Veo 2. 2024. 1, 3

  11. [11]

    LTX-Video: Realtime Video Latent Diffusion

    Yoav HaCohen, Nisan Chiprut, Benny Brazowski, Daniel Shalem, Dudu Moshe, Eitan Richardson, Eran Levin, Guy Shiran, Nir Zabari, Ori Gordon, Poriya Panet, Sapir Weiss- buch, Victor Kulikov, Yaki Bitterman, Zeev Melumian, and Ofir Bibi. Ltx-video: Realtime video latent diffusion.arXiv preprint arXiv:2501.00103, 2024. 3, 6, 7

  12. [12]

    Inner monologue: Embodied reasoning through planning with language models

    Wenlong Huang, Fei Xia, Ted Xiao, Harris Chan, Jacky Liang, Pete Florence, Andy Zeng, Jonathan Tompson, Igor Mordatch, Yevgen Chebotar, Pierre Sermanet, Noah Brown, Tomas Jackson, Linda Luu, Sergey Levine, Karol Hausman, and Brian Ichter. Inner monologue: Embodied reasoning through planning with language models. InCoRL, 2022. 4

  13. [13]

    VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic Faithfulness

    Ziqi Huang, Fan Zhang, Xiaojie Xu, Yinan He, Jiashuo Yu, Ziyue Dong, Qianli Ma, Nattapol Chanpaisit, Chenyang Si, Yuming Jiang, Yaohui Wang, Xinyuan Chen, Yin Cui, Chun- ping Wang, Mohit Bansal, Ziwei Liu, and Yu Qiao. VBench- 2.0: Advancing video generation benchmark suite for intrin- sic faithfulness.arXiv preprint arXiv:2503.21755, 2025. 1

  14. [14]

    Genagent: Scaling text-to-image gen- eration via agentic multimodal reasoning.arXiv preprint arXiv:2601.18543, 2026

    Kaixun Jiang, Yuzheng Wang, Junjie Zhou, Pandeng Li, Zhi- hang Liu, Chen-Wei Xie, Zhaoyu Chen, Yun Zheng, and Wenqiang Zhang. Genagent: Scaling text-to-image gen- eration via agentic multimodal reasoning.arXiv preprint arXiv:2601.18543, 2026. 3

  15. [15]

    How Far is Video Generation from World Model: A Physical Law Perspective

    Bingyi Kang, Yang Xiao, Jiaze Wang, Mattia Segu, Jiashi Feng, and Hengshuang Zhao. How far is video genera- tion from world model: A physical law perspective.arXiv preprint arXiv:2411.02385, 2024. 1, 2

  16. [16]

    KLING AI

    KlingAI. KLING AI. 2024. 1, 3

  17. [17]

    HunyuanVideo: A Systematic Framework For Large Video Generative Models

    Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024. 6, 7

  18. [18]

    codrawagents: A multi-agent dia- logue framework for compositional image generation.arXiv preprint arXiv:2603.12829, 2026

    Chunhan Li, Qifeng Wu, Jia-Hui Pan, Ka-Hei Hui, Jingyu Hu, Yuming Jiang, Bin Sheng, Xihui Liu, Wenjuan Gong, and Zhengzhe Liu. codrawagents: A multi-agent dia- logue framework for compositional image generation.arXiv preprint arXiv:2603.12829, 2026. 3

  19. [19]

    In-the-flow agentic system optimization for effective plan- ning and tool use.arXiv preprint arXiv:2510.05592, 2025

    Zhuofeng Li, Haoxiang Zhang, Seungju Han, Sheng Liu, Jianwen Xie, Yu Zhang, Yejin Choi, James Zou, and Pan Lu. In-the-flow agentic system optimization for effective plan- ning and tool use.arXiv preprint arXiv:2510.05592, 2025. 2, 3, 4, 5, 6

  20. [20]

    Video-t1: Test-time scaling for video generation

    Haoran Liu, Yanzuo Lu, Yicheng Xiao, Jianqi Chen, Jiaming Liu, Chao Du, and Bo An. Video-t1: Test-time scaling for video generation. InICCV, 2025. 2

  21. [21]

    Flow-grpo: Training flow matching models via on- line rl.Advances in neural information processing systems, 38:40783–40818, 2026

    Jie Liu, Gongye Liu, Jiajun Liang, Yangguang Li, Jiaheng Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Wanli Ouyang. Flow-grpo: Training flow matching models via on- line rl.Advances in neural information processing systems, 38:40783–40818, 2026. 3

  22. [22]

    Towards World Simulator: Crafting Physical Commonsense-Based Benchmark for Video Generation

    Qian Meng, Jiayang Xu, Chuanyang Jin, Hang Dong, Run- jian Chen, Zhipeng Zhao, Yibing Song, and Di Zhang. To- wards world simulator: Crafting physical commonsense- based benchmark for video generation.arXiv preprint arXiv:2410.05363, 2024. Accepted at ICML 2025. 1, 2, 6

  23. [23]

    Do generative video models understand physical principles?

    Saman Motamed, Laura Culp, Kevin Swersky, Priyank Jaini, and Robert Geirhos. Do generative video models understand physical principles?arXiv preprint arXiv:2501.09038, 2025. 1

  24. [24]

    PhyCo: Learning Controllable Physical Priors for Generative Motion

    Sriram Narayanan, Ziyu Jiang, Srinivasa Narasimhan, and Manmohan Chandraker. Phyco: Learning controllable physical priors for generative motion.arXiv preprint arXiv:2604.28169, 2026. 3

  25. [25]

    Toolformer: Lan- guage models can teach themselves to use tools.Advances in neural information processing systems, 36:68539–68551,

    Timo Schick, Jane Dwivedi-Yu, Roberto Dess `ı, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Lan- guage models can teach themselves to use tools.Advances in neural information processing systems, 36:68539–68551,

  26. [26]

    Phantom: Physics-Infused Video Generation via Joint Modeling of Visual and Latent Physical Dynamics

    Ying Shen, Jerry Xiong, Tianjiao Yu, and Ismini Lourentzou. Phantom: Physics-infused video generation via joint model- ing of visual and latent physical dynamics.arXiv preprint arXiv:2604.08503, 2026. 3

  27. [27]

    Reflexion: Language agents with verbal reinforcement learning

    Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning. InNeurIPS, 2023. 4

  28. [28]

    Agentic Reasoning and Tool Integration for LLMs via Reinforcement Learning

    Joykirat Singh, Raghav Magazine, Yash Pandya, and Akshay Nambi. Agentic reasoning and tool integration for llms via reinforcement learning.arXiv preprint arXiv:2505.01441,

  29. [29]

    Physmotion: Physics- grounded dynamics from a single image.arXiv preprint arXiv:2411.17189, 2024

    Xiyang Tan, Ying Jiang, Xuan Li, Zeshun Zong, Tianyi Xie, Yin Yang, and Chenfanfu Jiang. Physmotion: Physics- grounded dynamics from a single image.arXiv preprint arXiv:2411.17189, 2024. 3

  30. [30]

    Wan2.1-T2V-14B

    Wan AI. Wan2.1-T2V-14B. 2025. 1, 3, 6, 7

  31. [31]

    Physctrl: Generative physics for controllable and physics-grounded video gener- ation.Advances in Neural Information Processing Systems, 38:167907–167932, 2026

    Chen Wang, Chuhao Chen, Yiming Huang, Zhiyang Dou, Yuan Liu, Jiatao Gu, and Lingjie Liu. Physctrl: Generative physics for controllable and physics-grounded video gener- ation.Advances in Neural Information Processing Systems, 38:167907–167932, 2026. 3

  32. [32]

    Wisa: World simulator assistant for physics-aware text-to-video genera- tion.Advances in Neural Information Processing Systems, 38:5388–5416, 2026

    Jing Wang, Ao Ma, Ke Cao, Jun Zheng, Jiasong Feng, Zhan- jie Zhang, Wanyuan Pang, and Xiaodan Liang. Wisa: World simulator assistant for physics-aware text-to-video genera- tion.Advances in Neural Information Processing Systems, 38:5388–5416, 2026. 3, 6, 7

  33. [33]

    ProPhy: Progressive Physical Alignment for Dynamic World Simulation

    Zijun Wang, Panwen Hu, Jing Wang, Terry Jingchen Zhang, Yuhao Cheng, Long Chen, Yiqiang Yan, Zutao Jiang, Han- hui Li, and Xiaodan Liang. Prophy: Progressive physical alignment for dynamic world simulation.arXiv preprint arXiv:2512.05564, 2025. 3

  34. [34]

    Chain of event-centric causal thought for physically plausible video generation.arXiv preprint arXiv:2603.09094, 2026

    Zixuan Wang, Yixin Hu, Haolan Wang, Feng Chen, Yan Liu, Wen Li, and Yinjie Lei. Chain of event-centric causal thought for physically plausible video generation.arXiv preprint arXiv:2603.09094, 2026. 3

  35. [35]

    Physanimator: Physics-guided generative cartoon animation

    Tianyi Xie, Yiwei Zhao, Ying Jiang, and Chenfanfu Jiang. Physanimator: Physics-guided generative cartoon animation. InProceedings of the Computer Vision and Pattern Recogni- tion Conference, pages 10793–10804, 2025. 3

  36. [36]

    Phyt2v: Llm-guided iterative self-refinement for physics- grounded text-to-video generation

    Qiyao Xue, Xiangyu Yin, Boyuan Yang, and Wei Gao. Phyt2v: Llm-guided iterative self-refinement for physics- grounded text-to-video generation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 18826–18836, 2025. 2

  37. [37]

    M3: High-fidelity text-to-image generation via multi-modal, multi-agent and multi-round visual reasoning

    Bangji Yang, Ruihan Guo, Jiajun Fan, Chaoran Cheng, and Ge Liu. M3: High-fidelity text-to-image generation via multi-modal, multi-agent and multi-round visual reasoning. arXiv preprint arXiv:2602.06166, 2026. 3

  38. [38]

    Cogvideox: Text-to- video diffusion models with an expert transformer

    Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xi- aohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to- video diffusion models with an expert transformer. InIn- ternational Conference on Learning Representations, pages 83048–83077, 2025. 6, 7

  39. [39]

    ReAct: Synergizing Reasoning and Acting in Language Models

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models.arXiv preprint arXiv:2210.03629, 2022. 3, 4

  40. [40]

    Newtongen: Physics- consistent and controllable text-to-video generation via neu- ral newtonian dynamics.arXiv preprint arXiv:2509.21309,

    Yu Yuan, Xijun Wang, Tharindu Wickremasinghe, Zeeshan Nadir, Bole Ma, and Stanley H Chan. Newtongen: Physics- consistent and controllable text-to-video generation via neu- ral newtonian dynamics.arXiv preprint arXiv:2509.21309,

  41. [41]

    The Landscape of Agentic Reinforcement Learning for LLMs: A Survey

    Guibin Zhang, Hejia Geng, Xiaohang Yu, Zhenfei Yin, Zaibin Zhang, Zelin Tan, Heng Zhou, Zhongzhi Li, Xi- angyuan Xue, Yijiang Li, et al. The landscape of agentic reinforcement learning for llms: A survey.arXiv preprint arXiv:2509.02547, 2025. 3

  42. [42]

    Physchoreo: Physics-controllable video generation with part-aware se- mantic grounding.arXiv preprint arXiv:2511.20562, 2025

    Haoze Zhang, Tianyu Huang, Zichen Wan, Xiaowei Jin, Hongzhi Zhang, Hui Li, and Wangmeng Zuo. Physchoreo: Physics-controllable video generation with part-aware se- mantic grounding.arXiv preprint arXiv:2511.20562, 2025. 3

  43. [43]

    Adding conditional control to text-to-image diffusion models

    Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In ICCV, 2023. 2

  44. [44]

    Vide- orepa: Learning physics for video generation through rela- tional alignment with foundation models.Advances in Neu- ral Information Processing Systems, 38:122647–122676,

    Xiangdong Zhang, Jiaqi Liao, Shaofeng Zhang, Fanqing Meng, Xiangpeng Wan, Junchi Yan, and Yu Cheng. Vide- orepa: Learning physics for video generation through rela- tional alignment with foundation models.Advances in Neu- ral Information Processing Systems, 38:122647–122676,