pith. sign in

arxiv: 2604.10383 · v1 · submitted 2026-04-11 · 💻 cs.CV

Agentic Video Generation: From Text to Executable Event Graphs via Tool-Constrained LLM Planning

Pith reviewed 2026-05-10 15:18 UTC · model grok-4.3

classification 💻 cs.CV
keywords agentic video generationevent graphsLLM planninggame engine executionphysical validitysemantic alignmentmulti-agent systemsstructured video generation
0
0 comments X

The pith

LLM agents construct executable event graphs that a 3D game engine runs to create videos from text.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that current neural video generators produce outputs that lack reliable physical behavior and story consistency because they generate pixels without explicit constraints. Instead, the authors have an LLM build a Graph of Events in Space and Time that fully specifies actors, actions, objects, and temporal rules, then execute that graph deterministically inside a game engine. A two-agent setup with a Director for overall planning and a Scene Builder for scene construction uses tool calls to enforce simulator rules at every step, avoiding the failures seen in pure LLM refinement attempts. Human evaluations in seeded tests show the resulting videos score higher on physical validity and semantic alignment than those from VEO 3.1 and WAN 2.2. This approach matters because it supplies an explicit, checkable representation that turns narrative text into grounded, replayable video rather than relying on opaque learned patterns.

Core claim

The central claim is that separating the LLM's narrative reasoning from a programmatic backend that validates all simulator constraints produces GEST specifications that are executable by construction, yielding videos with 58 percent physical validity and 3.75 out of 5 semantic alignment in direct comparisons against neural baselines that achieve only 25 percent and 20 percent validity along with lower alignment scores.

What carries the argument

The Graph of Events in Space and Time (GEST), a structured formal specification of actors, actions, objects, and temporal constraints that is populated by a hierarchical Director and Scene Builder agent pair using validated tool calls to a state backend.

If this is right

  • Agentic outputs win 79 percent of text comparisons and 74 percent of video comparisons against procedural baselines in LLM jury evaluations.
  • Staged LLM refinement alone produces zero executable specifications in fifty attempts, showing the need for tool-enforced state management.
  • Relation Subagents are required to fill logical and semantic edges that procedural methods leave empty, exercising the full capacity of the GEST representation.
  • Engine execution guarantees every specification is valid, removing the semantic unreliability seen in direct pixel generation systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The GEST could serve as explicit ground-truth labels to train or fine-tune neural video models on physically consistent data.
  • Users could edit the event graph directly to steer generation without re-prompting the entire system.
  • The separation of planning and execution might extend to other domains that require both narrative flexibility and hard constraint satisfaction, such as robotic task planning.

Load-bearing premise

The 3D game engine can run the full range of complex interactions described in the GEST without adding simulation artifacts that would cancel the physical validity gains.

What would settle it

Generation of a multi-agent scene with precise object interactions where the engine output visibly violates the stated temporal or physical constraints in the GEST, such as incorrect collision responses or missing causal links.

Figures

Figures reproduced from arXiv: 2604.10383 by Marius Leordeanu, Mihai Masala, Nicolae Cudlenco.

Figure 1
Figure 1. Figure 1: System architecture. The Director Agent explores the simulation world and plans the story, delegating individual scenes to the Scene Builder Subagent. Both agents interact through Exploration Tools (read-only) and Building Tools (constraint￾enforcing), which operate on the State Backend—the procedural generator repurposed as a tool layer. backend (Section 5.5). The agents never manipulate the GEST directly… view at source ↗
Figure 2
Figure 2. Figure 2: Example agentic story. Top: GEST subgraph for Marcus (red) and Sarah (blue) across two scenes. Director’s narrative: “Two old friends reunite [. . . ] Marcus grabs drinks, they chat and laugh, Marcus smokes while Sarah calls a friend, they share a hug.” Middle: Scene 1 frames. Bottom: Scene 2 frames. Post-simulation VDG: “A man answers a call [. . . ] chats with a woman [. . . ] grabs a drink [. . . ] they… view at source ↗
read the original abstract

Existing multi-agent video generation systems use LLM agents to orchestrate neural video generators, producing visually impressive but semantically unreliable outputs with no ground truth annotations. We present an agentic system that inverts this paradigm: instead of generating pixels, the LLM constructs a formal Graph of Events in Space and Time (GEST) -- a structured specification of actors, actions, objects, and temporal constraints -- which is then executed deterministically in a 3D game engine. A staged LLM refinement pipeline fails entirely at this task (0 of 50 attempts produce an executable specification), motivating a fundamentally different architecture based on a separation of concerns: the LLM handles narrative planning through natural language reasoning, while a programmatic state backend enforces all simulator constraints through validated tool calls, guaranteeing that every generated specification is executable by construction. The system uses a hierarchical two-agent architecture -- a Director that plans the story and a Scene Builder that constructs individual scenes through a round-based state machine -- with dedicated Relation Subagents that populate the logical and semantic edge types of the GEST formalism that procedural generation leaves empty, making this the first approach to exercise the full expressive capacity of the representation. We evaluate in two stages: autonomous generation against procedural baselines via a 3-model LLM jury, where agentic narratives win 79% of text and 74% of video comparisons; and seeded generation where the same text is given to our system, VEO 3.1, and WAN 2.2, with human annotations showing engine-generated videos substantially outperform neural generators on physical validity (58% vs 25% and 20%) and semantic alignment (3.75/5 vs 2.33 and 1.50).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces an agentic video generation framework that uses LLMs to construct a formal Graph of Events in Space and Time (GEST) specification of actors, actions, objects, and temporal constraints, which is then executed deterministically in a 3D game engine rather than synthesizing pixels directly. It describes a hierarchical architecture with a Director for narrative planning, a Scene Builder using a round-based state machine, and dedicated Relation Subagents to populate logical and semantic relations. The system is motivated by the failure of staged LLM refinement (0/50 executable outputs) and evaluated in two stages: autonomous generation against procedural baselines via a 3-model LLM jury (79% text and 74% video wins) and seeded generation against VEO 3.1 and WAN 2.2, where human annotations show engine outputs scoring 58% physical validity (vs. 25% and 20%) and 3.75/5 semantic alignment (vs. 2.33 and 1.50).

Significance. If the empirical comparisons hold under rigorous scrutiny, the work offers a substantive alternative to neural video generators by enforcing executability and physical constraints through symbolic planning and simulation. The separation of LLM-based narrative reasoning from programmatic constraint enforcement, along with the use of a full GEST representation, provides a falsifiable and reproducible pathway for complex scene generation that could improve reliability in applications requiring semantic and physical consistency.

major comments (3)
  1. [Evaluation] Evaluation section (human annotation results): The central claim of substantial outperformance on physical validity (58% vs. 25%/20%) and semantic alignment (3.75/5 vs. 2.33/1.50) is load-bearing, yet the manuscript provides no sample size, inter-annotator agreement, definition of physical validity criteria, or controls for prompt engineering and annotator bias. This directly affects the strength of support for the seeded-generation comparison to VEO 3.1 and WAN 2.2.
  2. [Method] Method section (Scene Builder and Relation Subagents): The architecture guarantees syntactic executability via tool calls and the programmatic state backend, but contains no quantitative audit or coverage analysis of simulation artifacts (e.g., collision resolution, secondary effects, or fidelity loss in multi-actor scenes) that the 3D engine may introduce when executing complex GEST specifications. Such artifacts would be invisible to the 'executable by construction' guarantee yet would undermine the reported physical-validity advantage.
  3. [Abstract and Evaluation] Abstract and Evaluation: The staged LLM refinement baseline is reported as failing in 0 of 50 attempts, but the manuscript does not detail the exact prompting strategy, failure modes, or how this baseline was constructed, making it difficult to assess whether the proposed hierarchical architecture's advantages are fairly isolated from prompt-engineering effects.
minor comments (2)
  1. [Introduction] The GEST formalism is referenced throughout but lacks an early formal definition, diagram, or edge-type enumeration that would clarify how the Relation Subagents populate the representation.
  2. [Evaluation] Figure captions and table headers could more explicitly link reported percentages to the exact comparison conditions (autonomous vs. seeded) to avoid reader confusion.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments identify key areas where additional transparency and analysis will strengthen the manuscript. We address each major comment below and commit to revisions that improve the rigor of the evaluation and method sections without altering the core claims.

read point-by-point responses
  1. Referee: [Evaluation] Evaluation section (human annotation results): The central claim of substantial outperformance on physical validity (58% vs. 25%/20%) and semantic alignment (3.75/5 vs. 2.33/1.50) is load-bearing, yet the manuscript provides no sample size, inter-annotator agreement, definition of physical validity criteria, or controls for prompt engineering and annotator bias. This directly affects the strength of support for the seeded-generation comparison to VEO 3.1 and WAN 2.2.

    Authors: We agree that these details are necessary to fully support the reported results. In the revised manuscript we will expand the Evaluation section with the sample size used for human annotations, inter-annotator agreement statistics, the explicit criteria employed for physical validity judgments, and a description of controls including randomized video presentation order and blinded annotation procedures. These additions will be placed in a dedicated subsection on the human study protocol. revision: yes

  2. Referee: [Method] Method section (Scene Builder and Relation Subagents): The architecture guarantees syntactic executability via tool calls and the programmatic state backend, but contains no quantitative audit or coverage analysis of simulation artifacts (e.g., collision resolution, secondary effects, or fidelity loss in multi-actor scenes) that the 3D engine may introduce when executing complex GEST specifications. Such artifacts would be invisible to the 'executable by construction' guarantee yet would undermine the reported physical-validity advantage.

    Authors: This observation is correct; executability does not automatically preclude simulation-level artifacts. We will add a quantitative audit subsection to the Method section that reports coverage statistics on collision resolutions, secondary physics effects, and any observed fidelity loss across the evaluated multi-actor scenes. This analysis will be tied directly to the human physical-validity annotations to clarify the contribution of the GEST representation versus engine behavior. revision: yes

  3. Referee: [Abstract and Evaluation] Abstract and Evaluation: The staged LLM refinement baseline is reported as failing in 0 of 50 attempts, but the manuscript does not detail the exact prompting strategy, failure modes, or how this baseline was constructed, making it difficult to assess whether the proposed hierarchical architecture's advantages are fairly isolated from prompt-engineering effects.

    Authors: We acknowledge that greater detail on the baseline construction is required for fair comparison. The revised manuscript will expand both the Abstract and Evaluation sections to describe the precise prompting strategy used for staged refinement, a categorization of the observed failure modes across the 50 attempts, and the sampling procedure that aligned the baseline prompts with those used for the hierarchical system. This will help isolate the benefits of the Director-Scene Builder separation from prompt-engineering variations. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on external empirical comparisons

full rationale

The paper's load-bearing results derive from two independent evaluation stages: an LLM jury comparing agentic outputs to procedural baselines (79%/74% win rates) and human annotations on seeded generation against VEO 3.1 and WAN 2.2 (58% vs 25%/20% physical validity; 3.75 vs 2.33/1.50 semantic alignment). These metrics are measured externally and do not reduce to quantities defined inside the system. The 'executable by construction' property follows directly from the programmatic state backend and validated tool calls, which is an architectural design choice rather than a self-referential definition of the performance claims. No equations, fitted parameters, or self-citation chains are invoked to derive the reported advantages. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

The central claim depends on the game engine providing sufficient physical fidelity for GEST-specified events and on the LLM reliably generating valid tool calls within the enforced constraints; no free parameters or mathematical axioms are invoked.

invented entities (1)
  • GEST (Graph of Events in Space and Time) no independent evidence
    purpose: Formal structured specification of actors, actions, objects, and temporal constraints that the LLM constructs and the engine executes deterministically
    Core representation introduced to guarantee executability and enable full expressive capacity through subagent population of edge types.

pith-pipeline@v0.9.0 · 5617 in / 1318 out tokens · 72523 ms · 2026-05-10T15:18:32.724841+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. NEWTON: Agentic Planning for Physically Grounded Video Generation

    cs.CV 2026-05 unverdicted novelty 6.0

    NEWTON improves physical accuracy in video generation by deploying a trainable planner that coordinates physics-aware tools and a verifier, raising joint accuracy on VideoPhy-2 without altering the base generators.

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages · cited by 1 Pith paper · 1 internal anchor

  1. [1]

    Maintaining knowledge about temporal inter- vals.Communications of the ACM, 26(11):832–843, 1983

    James F Allen. Maintaining knowledge about temporal inter- vals.Communications of the ACM, 26(11):832–843, 1983. 3

  2. [2]

    Bedlam: A synthetic dataset of bodies exhibit- ing detailed lifelike animated motion

    Michael J Black, Priyanka Patel, Joachim Tesch, and Jin- long Yang. Bedlam: A synthetic dataset of bodies exhibit- ing detailed lifelike animated motion. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8726–8737, 2023. 2

  3. [3]

    Video generation models as world simulators

    Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luh- man, Eric Luhman, Clarence Ng, Ricky Wang, and Aditya Ramesh. Video generation models as world simulators

  4. [4]

    [tiny paper] GEST-engine: Controllable multi-actor video synthesis with perfect spatiotemporal annotations

    Nicolae Cudlenco, Mihai Masala, and Marius Leordeanu. [tiny paper] GEST-engine: Controllable multi-actor video synthesis with perfect spatiotemporal annotations. InICLR 2026 the 2nd Workshop on World Models: Understanding, Modelling and Scaling, 2026. 1, 2, 3, 6

  5. [5]

    Dysen-vdm: Empowering dynamics- aware text-to-video diffusion with llms

    Hao Fei, Shengqiong Wu, Wei Ji, Hanwang Zhang, and Tat-Seng Chua. Dysen-vdm: Empowering dynamics- aware text-to-video diffusion with llms. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7641–7653, 2024. 2

  6. [6]

    Veo 3 model card

    Google. Veo 3 model card. 2025. Accessed: March 04,

  7. [7]

    Storyagent: Cus- tomized storytelling video generation via multi-agent col- laboration

    Panwen Hu, Jin Jiang, Jianqi Chen, Mingfei Han, Shengcai Liao, Xiaojun Chang, and Xiaodan Liang. Storyagent: Cus- tomized storytelling video generation via multi-agent collab- oration.arXiv preprint arXiv:2411.04925, 2024. 1, 2

  8. [8]

    Action genome: Actions as compositions of spatio- temporal scene graphs

    Jingwei Ji, Ranjay Krishna, Li Fei-Fei, and Juan Carlos Niebles. Action genome: Actions as compositions of spatio- temporal scene graphs. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10236–10247, 2020. 2

  9. [9]

    Videodirectorgpt: Consistent multi- scene video generation via llm-guided planning.ArXiv, abs/2309.15091, 2023

    Han Lin, Abhay Zala, Jaemin Cho, and Mohit Bansal. Videodirectorgpt: Consistent multi-scene video generation via llm-guided planning.arXiv preprint arXiv:2309.15091,

  10. [10]

    Gpt4motion: Scripting physical motions in text-to-video generation via blender-oriented gpt planning

    Jiaxi Lv, Yi Huang, Mingfu Yan, Jiancheng Huang, Jianzhuang Liu, Yifan Liu, Yafei Wen, Xiaoxin Chen, and Shifeng Chen. Gpt4motion: Scripting physical motions in text-to-video generation via blender-oriented gpt planning. InProceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 1430–1440, 2024. 2, 3

  11. [11]

    From vision to language through graph of events in space and time: An explainable self-supervised approach.arXiv preprint arXiv:2507.04815, 2025

    Mihai Masala and Marius Leordeanu. From vision to language through graph of events in space and time: An explainable self-supervised approach.arXiv preprint arXiv:2507.04815, 2025. 6

  12. [12]

    Explaining vision and language through graphs of events in space and time

    Mihai Masala, Nicolae Cudlenco, Traian Rebedea, and Mar- ius Leordeanu. Explaining vision and language through graphs of events in space and time. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 2826–2831, 2023. 1, 2, 3, 6

  13. [13]

    Generative agents: Interactive simulacra of human behavior

    Joon Sung Park, Joseph O’Brien, Carrie Jun Cai, Mered- ith Ringel Morris, Percy Liang, and Michael S Bernstein. Generative agents: Interactive simulacra of human behavior. InProceedings of the 36th annual acm symposium on user interface software and technology, pages 1–22, 2023. 2

  14. [14]

    Virtualhome: Simulating household activities via programs

    Xavier Puig, Kevin Ra, Marko Boben, Jiaman Li, Tingwu Wang, Sanja Fidler, and Antonio Torralba. Virtualhome: Simulating household activities via programs. InProceed- ings of the IEEE conference on computer vision and pattern recognition, pages 8494–8502, 2018. 2

  15. [15]

    Playing for data: Ground truth from computer games

    Stephan R Richter, Vibhav Vineet, Stefan Roth, and Vladlen Koltun. Playing for data: Ground truth from computer games. InEuropean conference on computer vision, pages 102–118. Springer, 2016. 2

  16. [16]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video gen- erative models.arXiv preprint arXiv:2503.20314, 2025. 1, 6

  17. [17]

    Mavis: A multi-agent framework for long-sequence video storytelling

    Qian Wang, Ziqi Huang, Ruoxi Jia, Paul Debevec, and Ning Yu. Mavis: A multi-agent framework for long-sequence video storytelling. InProceedings of the 19th Conference of the European Chapter of the Association for Computa- tional Linguistics (Volume 1: Long Papers), pages 2273– 2295, 2026. 1, 2, 3

  18. [18]

    Dreamrunner: Fine-grained compositional story- to-video generation with retrieval-augmented motion adap- tation

    Zun Wang, Jialu Li, Han Lin, Jaehong Yoon, and Mohit Bansal. Dreamrunner: Fine-grained compositional story- to-video generation with retrieval-augmented motion adap- tation. InProceedings of the AAAI Conference on Artificial Intelligence, pages 10503–10511, 2026. 1, 2

  19. [19]

    Autogen: Enabling next-gen llm applica- tions via multi-agent conversations

    Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, et al. Autogen: Enabling next-gen llm applica- tions via multi-agent conversations. InFirst conference on language modeling, 2024. 2

  20. [20]

    Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 36:46595–46623, 2023

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 36:46595–46623, 2023. 6 10