NEWTON: Agentic Planning for Physically Grounded Video Generation
Pith reviewed 2026-05-20 10:52 UTC · model grok-4.3
The pith
An agentic planner improves physical accuracy in video generation by sequencing physics tools and verifying results instead of depending on text prompts alone.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We present NEWTON, in which video generation is demoted from the system output to one action inside an agent's toolbox: a learned planner orchestrates physics-aware tools (keyframe generation, scientific computation, prompt refinement) to construct rich conditioning, and a verifier closes the loop for iterative re-planning. The planner is the sole trainable component, optimized on-policy via Flow-GRPO inside the live multi-turn loop.
What carries the argument
The NEWTON learned planner that selects and sequences physics-aware tools inside a verifier-driven re-planning loop to produce conditioning that meets sufficiency, dynamism, and verifiability.
If this is right
- Joint accuracy on VideoPhy-2 rises from 21.4 percent to 29.7 percent for LTX-Video and from 30.7 percent to 37.4 percent for Veo-3.1.
- The same gains appear without any modification to the underlying video generators.
- The three required properties for physics conditioning—sufficiency, dynamism, and verifiability—are satisfied through tool orchestration rather than prompt engineering alone.
- The planner becomes the only component that needs training, keeping the rest of the system fixed.
Where Pith is reading between the lines
- The same planner-plus-verifier structure could be tested on other generative tasks that require external physical constraints, such as 3D scene synthesis or action sequence planning.
- The approach suggests that separating specification and verification from the core generator may be more scalable than attempting to embed all physics inside a single model.
- On-policy optimization inside the live loop may transfer to other agentic systems where the cost of poor tool choices is high.
Load-bearing premise
The planner can reliably choose and order the correct tools so that verifier feedback produces measurable gains in physical accuracy.
What would settle it
Replace the learned planner with random tool selection and measure whether the reported accuracy gains on VideoPhy-2 disappear for LTX-Video and Veo-3.1.
Figures
read the original abstract
Video generation models produce visually compelling results but systematically violate physical commonsense -- on VideoPhy-2, the best model achieves only 32.6% joint accuracy. We identify a specification bottleneck: text prompts are lossy compression of the physical world, omitting the parameters that fully determine dynamics, and no amount of model scaling can recover what was never specified. From this diagnosis we derive three properties that physics conditioning must satisfy -- sufficiency, dynamism, and verifiability -- and show that no existing approach satisfies all three. We present NEWTON, in which video generation is demoted from the system output to one action inside an agent's toolbox: a learned planner orchestrates physics-aware tools (keyframe generation, scientific computation, prompt refinement) to construct rich conditioning, and a verifier closes the loop for iterative re-planning. The planner is the sole trainable component, optimized on-policy via Flow-GRPO inside the live multi-turn loop. On VideoPhy-2, NEWTON improves joint accuracy from 21.4% to 29.7% on LTX-Video and from 30.7% to 37.4% on Veo-3.1, without modifying either generator. Our project page: https://Newton026.github.io/newton
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents NEWTON, a system for improving physical commonsense in video generation models. It diagnoses text prompts as a specification bottleneck and introduces an agentic planner that uses physics-aware tools (keyframe generation, scientific computation, prompt refinement) orchestrated in a verifier-driven re-planning loop. The planner is trained on-policy with Flow-GRPO. On the VideoPhy-2 benchmark, it reports joint accuracy improvements from 21.4% to 29.7% for LTX-Video and 30.7% to 37.4% for Veo-3.1, without altering the underlying video generators.
Significance. If the experimental results hold under proper controls, this work offers a promising direction for addressing physical inaccuracies in video generation by shifting from direct generation to agentic planning with verifiable tools. A notable strength is the decision to optimize only the planner while leaving the generators unmodified, along with the on-policy optimization inside the live multi-turn loop. This could have broader implications for grounded AI systems in vision and robotics.
major comments (2)
- [§5 (Experiments)] §5 (Experiments): The headline accuracy gains on VideoPhy-2 (21.4% → 29.7% joint accuracy on LTX-Video; 30.7% → 37.4% on Veo-3.1) are load-bearing for the central claim that the learned planner's tool sequencing produces the improvements. However, the manuscript does not report the mean number of generator invocations per test prompt for NEWTON versus the single-pass baselines, nor does it include a compute-matched baseline that draws the same number of videos and applies the same verifier filter. This leaves open the possibility that gains arise from extra sampling rather than the planner.
- [§4.2 (Training)] §4.2 (Training): The on-policy Flow-GRPO optimization is described at a high level, but the manuscript provides no details on how the verifier reward is aggregated across re-planning turns or how the policy handles variable-length trajectories; these choices directly affect whether the reported gains can be attributed to the planner rather than implementation-specific tuning.
minor comments (2)
- [§5 (Results)] Tables in §5 lack error bars or standard deviations across multiple runs, which would help assess the statistical reliability of the accuracy improvements.
- [§2 (Related Work)] The related work section would benefit from a more explicit comparison to prior agentic frameworks in vision that also use verifiers or tool orchestration.
Simulated Author's Rebuttal
We thank the referee for their constructive and insightful comments on our manuscript. We address each major comment below and indicate the revisions we will make to strengthen the paper.
read point-by-point responses
-
Referee: [§5 (Experiments)] §5 (Experiments): The headline accuracy gains on VideoPhy-2 (21.4% → 29.7% joint accuracy on LTX-Video; 30.7% → 37.4% on Veo-3.1) are load-bearing for the central claim that the learned planner's tool sequencing produces the improvements. However, the manuscript does not report the mean number of generator invocations per test prompt for NEWTON versus the single-pass baselines, nor does it include a compute-matched baseline that draws the same number of videos and applies the same verifier filter. This leaves open the possibility that gains arise from extra sampling rather than the planner.
Authors: We agree that this is a valid concern and that additional controls would help isolate the contribution of the planner. The current manuscript focuses on end-to-end accuracy but does not report invocation counts or a compute-matched baseline. In the revised version we will add the mean number of generator invocations per test prompt for NEWTON (and note that baselines use one invocation). We will also include results from a compute-matched baseline that generates the same average number of videos per prompt and applies the verifier filter, allowing direct comparison to the planner-driven approach. revision: yes
-
Referee: [§4.2 (Training)] §4.2 (Training): The on-policy Flow-GRPO optimization is described at a high level, but the manuscript provides no details on how the verifier reward is aggregated across re-planning turns or how the policy handles variable-length trajectories; these choices directly affect whether the reported gains can be attributed to the planner rather than implementation-specific tuning.
Authors: We acknowledge that the training description in §4.2 is high-level and that more implementation details are needed for reproducibility and attribution. In the revised manuscript we will expand this section to specify how the verifier reward is aggregated over re-planning turns and how the policy manages variable-length trajectories. revision: yes
Circularity Check
Derivation chain independent of inputs; empirical gains on external benchmark
full rationale
The paper starts from a conceptual diagnosis of a specification bottleneck in text prompts, derives three required properties (sufficiency, dynamism, verifiability), and constructs NEWTON as an agentic system whose planner is the only trainable part. The central result is an empirical lift in joint accuracy on the independent VideoPhy-2 benchmark (21.4%→29.7% on LTX-Video; 30.7%→37.4% on Veo-3.1) against unmodified generators. On-policy Flow-GRPO optimization occurs inside the live loop, yet the reported metric is not defined by the planner's loss or by construction from the training signals. No equations, self-citations, or renamings reduce the claimed improvement to a tautology or to extra generator calls by definition. The system is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Text prompts are lossy compression of the physical world, omitting parameters that fully determine dynamics.
- domain assumption Physics conditioning must satisfy sufficiency, dynamism, and verifiability.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
a learned planner orchestrates physics-aware tools (keyframe generation, scientific computation, prompt refinement) ... optimized on-policy via Flow-GRPO inside the live multi-turn loop
-
IndisputableMonolith/Foundation/DimensionForcing.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Python Computation.Provides a sandboxed Python environment for scientific computation—projectile trajectories, conservation-of-momentum calculations
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
World Simulation with Video Foundation Models for Physical AI
Arslan Ali, Junjie Bai, Maciej Bala, Yogesh Balaji, Aaron Blakeman, Tiffany Cai, Jiaxin Cao, Tianshi Cao, Eliza- beth Cha, Yu-Wei Chao, et al. World simulation with video foundation models for physical ai.arXiv preprint arXiv:2511.00062, 2025. 6, 7
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[2]
Xiangyu Bai, He Liang, Bishoy Galoaa, Utsav Nandi, Shayda Moezzi, Yuhang He, and Sarah Ostadabbas. More- gen: Multi-agent motion-reasoning engine for code-based text-to-video synthesis.arXiv preprint arXiv:2512.04221,
-
[3]
VideoPhy: Evaluating Physical Commonsense for Video Generation
Hritik Bansal, Yonatan Bitton, Idan Szpektor, Kai-Wei Chang, and Aditya Grover. VideoPhy: Evaluating phys- ical commonsense for video generation.arXiv preprint arXiv:2406.03520, 2024. 1
work page internal anchor Pith review arXiv 2024
-
[4]
Hritik Bansal, Clark Peng, Yonatan Bitton, Roman Golden- berg, Aditya Grover, and Kai-Wei Chang. VideoPhy-2: A challenging action-centric physical commonsense evaluation in video generation.arXiv preprint arXiv:2503.06800, 2025. 1, 3, 4, 6
-
[5]
Video generation models as world simulators
Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luh- man, Eric Luhman, Clarence Ng, Ricky Wang, and Aditya Ramesh. Video generation models as world simulators
-
[6]
Yuanhao Cai, Kunpeng Li, Menglin Jia, Jialiang Wang, Jun- zhe Sun, Feng Liang, Weifeng Chen, Felix Juefei-Xu, Chu Wang, Ali Thabet, et al. Phygdpo: Physics-aware groupwise direct preference optimization for physically consistent text- to-video generation.arXiv preprint arXiv:2512.24551, 2025. 3
-
[7]
Luca Collorone, Mert Kiray, Indro Spinelli, Fabio Galasso, and Benjamin Busam. Phystalk: Language-driven real- time physics in 3d gaussian scenes.arXiv preprint arXiv:2512.24986, 2025. 3
-
[8]
Agentic Video Generation: From Text to Executable Event Graphs via Tool-Constrained LLM Planning
Nicolae Cudlenco, Mihai Masala, and Marius Leordeanu. Agentic video generation: From text to executable event graphs via tool-constrained llm planning.arXiv preprint arXiv:2604.10383, 2026. 3
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[9]
Empowering Multi-Turn Tool-Integrated Agentic Reasoning with Group Turn Policy Optimization
Yifeng Ding, Hung Le, Songyang Han, Kangrui Ruan, Zhenghui Jin, Varun Kumar, Zijian Wang, and Anoop Deoras. Empowering multi-turn tool-integrated reason- ing with group turn policy optimization.arXiv preprint arXiv:2511.14846, 2025. 3
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[10]
Google DeepMind. Veo 2. 2024. 1, 3
work page 2024
-
[11]
LTX-Video: Realtime Video Latent Diffusion
Yoav HaCohen, Nisan Chiprut, Benny Brazowski, Daniel Shalem, Dudu Moshe, Eitan Richardson, Eran Levin, Guy Shiran, Nir Zabari, Ori Gordon, Poriya Panet, Sapir Weiss- buch, Victor Kulikov, Yaki Bitterman, Zeev Melumian, and Ofir Bibi. Ltx-video: Realtime video latent diffusion.arXiv preprint arXiv:2501.00103, 2024. 3, 6, 7
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[12]
Inner monologue: Embodied reasoning through planning with language models
Wenlong Huang, Fei Xia, Ted Xiao, Harris Chan, Jacky Liang, Pete Florence, Andy Zeng, Jonathan Tompson, Igor Mordatch, Yevgen Chebotar, Pierre Sermanet, Noah Brown, Tomas Jackson, Linda Luu, Sergey Levine, Karol Hausman, and Brian Ichter. Inner monologue: Embodied reasoning through planning with language models. InCoRL, 2022. 4
work page 2022
-
[13]
VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic Faithfulness
Ziqi Huang, Fan Zhang, Xiaojie Xu, Yinan He, Jiashuo Yu, Ziyue Dong, Qianli Ma, Nattapol Chanpaisit, Chenyang Si, Yuming Jiang, Yaohui Wang, Xinyuan Chen, Yin Cui, Chun- ping Wang, Mohit Bansal, Ziwei Liu, and Yu Qiao. VBench- 2.0: Advancing video generation benchmark suite for intrin- sic faithfulness.arXiv preprint arXiv:2503.21755, 2025. 1
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[14]
Kaixun Jiang, Yuzheng Wang, Junjie Zhou, Pandeng Li, Zhi- hang Liu, Chen-Wei Xie, Zhaoyu Chen, Yun Zheng, and Wenqiang Zhang. Genagent: Scaling text-to-image gen- eration via agentic multimodal reasoning.arXiv preprint arXiv:2601.18543, 2026. 3
-
[15]
How Far is Video Generation from World Model: A Physical Law Perspective
Bingyi Kang, Yang Xiao, Jiaze Wang, Mattia Segu, Jiashi Feng, and Hengshuang Zhao. How far is video genera- tion from world model: A physical law perspective.arXiv preprint arXiv:2411.02385, 2024. 1, 2
work page internal anchor Pith review arXiv 2024
- [16]
-
[17]
HunyuanVideo: A Systematic Framework For Large Video Generative Models
Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024. 6, 7
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[18]
Chunhan Li, Qifeng Wu, Jia-Hui Pan, Ka-Hei Hui, Jingyu Hu, Yuming Jiang, Bin Sheng, Xihui Liu, Wenjuan Gong, and Zhengzhe Liu. codrawagents: A multi-agent dia- logue framework for compositional image generation.arXiv preprint arXiv:2603.12829, 2026. 3
-
[19]
Zhuofeng Li, Haoxiang Zhang, Seungju Han, Sheng Liu, Jianwen Xie, Yu Zhang, Yejin Choi, James Zou, and Pan Lu. In-the-flow agentic system optimization for effective plan- ning and tool use.arXiv preprint arXiv:2510.05592, 2025. 2, 3, 4, 5, 6
-
[20]
Video-t1: Test-time scaling for video generation
Haoran Liu, Yanzuo Lu, Yicheng Xiao, Jianqi Chen, Jiaming Liu, Chao Du, and Bo An. Video-t1: Test-time scaling for video generation. InICCV, 2025. 2
work page 2025
-
[21]
Jie Liu, Gongye Liu, Jiajun Liang, Yangguang Li, Jiaheng Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Wanli Ouyang. Flow-grpo: Training flow matching models via on- line rl.Advances in neural information processing systems, 38:40783–40818, 2026. 3
work page 2026
-
[22]
Towards World Simulator: Crafting Physical Commonsense-Based Benchmark for Video Generation
Qian Meng, Jiayang Xu, Chuanyang Jin, Hang Dong, Run- jian Chen, Zhipeng Zhao, Yibing Song, and Di Zhang. To- wards world simulator: Crafting physical commonsense- based benchmark for video generation.arXiv preprint arXiv:2410.05363, 2024. Accepted at ICML 2025. 1, 2, 6
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[23]
Do generative video models understand physical principles?
Saman Motamed, Laura Culp, Kevin Swersky, Priyank Jaini, and Robert Geirhos. Do generative video models understand physical principles?arXiv preprint arXiv:2501.09038, 2025. 1
work page internal anchor Pith review arXiv 2025
-
[24]
PhyCo: Learning Controllable Physical Priors for Generative Motion
Sriram Narayanan, Ziyu Jiang, Srinivasa Narasimhan, and Manmohan Chandraker. Phyco: Learning controllable physical priors for generative motion.arXiv preprint arXiv:2604.28169, 2026. 3
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[25]
Timo Schick, Jane Dwivedi-Yu, Roberto Dess `ı, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Lan- guage models can teach themselves to use tools.Advances in neural information processing systems, 36:68539–68551,
-
[26]
Phantom: Physics-Infused Video Generation via Joint Modeling of Visual and Latent Physical Dynamics
Ying Shen, Jerry Xiong, Tianjiao Yu, and Ismini Lourentzou. Phantom: Physics-infused video generation via joint model- ing of visual and latent physical dynamics.arXiv preprint arXiv:2604.08503, 2026. 3
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[27]
Reflexion: Language agents with verbal reinforcement learning
Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning. InNeurIPS, 2023. 4
work page 2023
-
[28]
Agentic Reasoning and Tool Integration for LLMs via Reinforcement Learning
Joykirat Singh, Raghav Magazine, Yash Pandya, and Akshay Nambi. Agentic reasoning and tool integration for llms via reinforcement learning.arXiv preprint arXiv:2505.01441,
work page internal anchor Pith review Pith/arXiv arXiv
-
[29]
Physmotion: Physics- grounded dynamics from a single image.arXiv preprint arXiv:2411.17189, 2024
Xiyang Tan, Ying Jiang, Xuan Li, Zeshun Zong, Tianyi Xie, Yin Yang, and Chenfanfu Jiang. Physmotion: Physics- grounded dynamics from a single image.arXiv preprint arXiv:2411.17189, 2024. 3
- [30]
-
[31]
Chen Wang, Chuhao Chen, Yiming Huang, Zhiyang Dou, Yuan Liu, Jiatao Gu, and Lingjie Liu. Physctrl: Generative physics for controllable and physics-grounded video gener- ation.Advances in Neural Information Processing Systems, 38:167907–167932, 2026. 3
work page 2026
-
[32]
Jing Wang, Ao Ma, Ke Cao, Jun Zheng, Jiasong Feng, Zhan- jie Zhang, Wanyuan Pang, and Xiaodan Liang. Wisa: World simulator assistant for physics-aware text-to-video genera- tion.Advances in Neural Information Processing Systems, 38:5388–5416, 2026. 3, 6, 7
work page 2026
-
[33]
ProPhy: Progressive Physical Alignment for Dynamic World Simulation
Zijun Wang, Panwen Hu, Jing Wang, Terry Jingchen Zhang, Yuhao Cheng, Long Chen, Yiqiang Yan, Zutao Jiang, Han- hui Li, and Xiaodan Liang. Prophy: Progressive physical alignment for dynamic world simulation.arXiv preprint arXiv:2512.05564, 2025. 3
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[34]
Zixuan Wang, Yixin Hu, Haolan Wang, Feng Chen, Yan Liu, Wen Li, and Yinjie Lei. Chain of event-centric causal thought for physically plausible video generation.arXiv preprint arXiv:2603.09094, 2026. 3
-
[35]
Physanimator: Physics-guided generative cartoon animation
Tianyi Xie, Yiwei Zhao, Ying Jiang, and Chenfanfu Jiang. Physanimator: Physics-guided generative cartoon animation. InProceedings of the Computer Vision and Pattern Recogni- tion Conference, pages 10793–10804, 2025. 3
work page 2025
-
[36]
Phyt2v: Llm-guided iterative self-refinement for physics- grounded text-to-video generation
Qiyao Xue, Xiangyu Yin, Boyuan Yang, and Wei Gao. Phyt2v: Llm-guided iterative self-refinement for physics- grounded text-to-video generation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 18826–18836, 2025. 2
work page 2025
-
[37]
Bangji Yang, Ruihan Guo, Jiajun Fan, Chaoran Cheng, and Ge Liu. M3: High-fidelity text-to-image generation via multi-modal, multi-agent and multi-round visual reasoning. arXiv preprint arXiv:2602.06166, 2026. 3
-
[38]
Cogvideox: Text-to- video diffusion models with an expert transformer
Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xi- aohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to- video diffusion models with an expert transformer. InIn- ternational Conference on Learning Representations, pages 83048–83077, 2025. 6, 7
work page 2025
-
[39]
ReAct: Synergizing Reasoning and Acting in Language Models
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models.arXiv preprint arXiv:2210.03629, 2022. 3, 4
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[40]
Yu Yuan, Xijun Wang, Tharindu Wickremasinghe, Zeeshan Nadir, Bole Ma, and Stanley H Chan. Newtongen: Physics- consistent and controllable text-to-video generation via neu- ral newtonian dynamics.arXiv preprint arXiv:2509.21309,
-
[41]
The Landscape of Agentic Reinforcement Learning for LLMs: A Survey
Guibin Zhang, Hejia Geng, Xiaohang Yu, Zhenfei Yin, Zaibin Zhang, Zelin Tan, Heng Zhou, Zhongzhi Li, Xi- angyuan Xue, Yijiang Li, et al. The landscape of agentic reinforcement learning for llms: A survey.arXiv preprint arXiv:2509.02547, 2025. 3
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[42]
Haoze Zhang, Tianyu Huang, Zichen Wan, Xiaowei Jin, Hongzhi Zhang, Hui Li, and Wangmeng Zuo. Physchoreo: Physics-controllable video generation with part-aware se- mantic grounding.arXiv preprint arXiv:2511.20562, 2025. 3
-
[43]
Adding conditional control to text-to-image diffusion models
Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In ICCV, 2023. 2
work page 2023
-
[44]
Xiangdong Zhang, Jiaqi Liao, Shaofeng Zhang, Fanqing Meng, Xiangpeng Wan, Junchi Yan, and Yu Cheng. Vide- orepa: Learning physics for video generation through rela- tional alignment with foundation models.Advances in Neu- ral Information Processing Systems, 38:122647–122676,
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.