Agentic Video Generation: From Text to Executable Event Graphs via Tool-Constrained LLM Planning

Marius Leordeanu; Mihai Masala; Nicolae Cudlenco

arxiv: 2604.10383 · v1 · submitted 2026-04-11 · 💻 cs.CV

Agentic Video Generation: From Text to Executable Event Graphs via Tool-Constrained LLM Planning

Nicolae Cudlenco , Mihai Masala , Marius Leordeanu This is my paper

Pith reviewed 2026-05-10 15:18 UTC · model grok-4.3

classification 💻 cs.CV

keywords agentic video generationevent graphsLLM planninggame engine executionphysical validitysemantic alignmentmulti-agent systemsstructured video generation

0 comments

The pith

LLM agents construct executable event graphs that a 3D game engine runs to create videos from text.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that current neural video generators produce outputs that lack reliable physical behavior and story consistency because they generate pixels without explicit constraints. Instead, the authors have an LLM build a Graph of Events in Space and Time that fully specifies actors, actions, objects, and temporal rules, then execute that graph deterministically inside a game engine. A two-agent setup with a Director for overall planning and a Scene Builder for scene construction uses tool calls to enforce simulator rules at every step, avoiding the failures seen in pure LLM refinement attempts. Human evaluations in seeded tests show the resulting videos score higher on physical validity and semantic alignment than those from VEO 3.1 and WAN 2.2. This approach matters because it supplies an explicit, checkable representation that turns narrative text into grounded, replayable video rather than relying on opaque learned patterns.

Core claim

The central claim is that separating the LLM's narrative reasoning from a programmatic backend that validates all simulator constraints produces GEST specifications that are executable by construction, yielding videos with 58 percent physical validity and 3.75 out of 5 semantic alignment in direct comparisons against neural baselines that achieve only 25 percent and 20 percent validity along with lower alignment scores.

What carries the argument

The Graph of Events in Space and Time (GEST), a structured formal specification of actors, actions, objects, and temporal constraints that is populated by a hierarchical Director and Scene Builder agent pair using validated tool calls to a state backend.

If this is right

Agentic outputs win 79 percent of text comparisons and 74 percent of video comparisons against procedural baselines in LLM jury evaluations.
Staged LLM refinement alone produces zero executable specifications in fifty attempts, showing the need for tool-enforced state management.
Relation Subagents are required to fill logical and semantic edges that procedural methods leave empty, exercising the full capacity of the GEST representation.
Engine execution guarantees every specification is valid, removing the semantic unreliability seen in direct pixel generation systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The GEST could serve as explicit ground-truth labels to train or fine-tune neural video models on physically consistent data.
Users could edit the event graph directly to steer generation without re-prompting the entire system.
The separation of planning and execution might extend to other domains that require both narrative flexibility and hard constraint satisfaction, such as robotic task planning.

Load-bearing premise

The 3D game engine can run the full range of complex interactions described in the GEST without adding simulation artifacts that would cancel the physical validity gains.

What would settle it

Generation of a multi-agent scene with precise object interactions where the engine output visibly violates the stated temporal or physical constraints in the GEST, such as incorrect collision responses or missing causal links.

Figures

Figures reproduced from arXiv: 2604.10383 by Marius Leordeanu, Mihai Masala, Nicolae Cudlenco.

**Figure 1.** Figure 1: System architecture. The Director Agent explores the simulation world and plans the story, delegating individual scenes to the Scene Builder Subagent. Both agents interact through Exploration Tools (read-only) and Building Tools (constraintenforcing), which operate on the State Backend—the procedural generator repurposed as a tool layer. backend (Section 5.5). The agents never manipulate the GEST directly… view at source ↗

**Figure 2.** Figure 2: Example agentic story. Top: GEST subgraph for Marcus (red) and Sarah (blue) across two scenes. Director’s narrative: “Two old friends reunite [. . . ] Marcus grabs drinks, they chat and laugh, Marcus smokes while Sarah calls a friend, they share a hug.” Middle: Scene 1 frames. Bottom: Scene 2 frames. Post-simulation VDG: “A man answers a call [. . . ] chats with a woman [. . . ] grabs a drink [. . . ] they… view at source ↗

read the original abstract

Existing multi-agent video generation systems use LLM agents to orchestrate neural video generators, producing visually impressive but semantically unreliable outputs with no ground truth annotations. We present an agentic system that inverts this paradigm: instead of generating pixels, the LLM constructs a formal Graph of Events in Space and Time (GEST) -- a structured specification of actors, actions, objects, and temporal constraints -- which is then executed deterministically in a 3D game engine. A staged LLM refinement pipeline fails entirely at this task (0 of 50 attempts produce an executable specification), motivating a fundamentally different architecture based on a separation of concerns: the LLM handles narrative planning through natural language reasoning, while a programmatic state backend enforces all simulator constraints through validated tool calls, guaranteeing that every generated specification is executable by construction. The system uses a hierarchical two-agent architecture -- a Director that plans the story and a Scene Builder that constructs individual scenes through a round-based state machine -- with dedicated Relation Subagents that populate the logical and semantic edge types of the GEST formalism that procedural generation leaves empty, making this the first approach to exercise the full expressive capacity of the representation. We evaluate in two stages: autonomous generation against procedural baselines via a 3-model LLM jury, where agentic narratives win 79% of text and 74% of video comparisons; and seeded generation where the same text is given to our system, VEO 3.1, and WAN 2.2, with human annotations showing engine-generated videos substantially outperform neural generators on physical validity (58% vs 25% and 20%) and semantic alignment (3.75/5 vs 2.33 and 1.50).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows LLMs can plan executable GEST graphs for a game engine to beat neural video models on physical validity and semantics in direct comparisons, but the human ratings and engine coverage need tighter checks.

read the letter

The main takeaway is that shifting from pixel-level neural generation to LLM-planned event graphs executed in a 3D engine produces videos that people rate higher on physical plausibility and story fit. The 58% vs 25%/20% validity edge and 3.75 vs 2.33/1.50 alignment scores in the seeded tests against VEO 3.1 and WAN 2.2 are the concrete numbers that matter here. They also document that a plain staged LLM pipeline fails to produce any executable specs at all, which explains why they added the tool-constrained backend and the Relation Subagents to fill out the full GEST edges. That separation of narrative planning from constraint enforcement is the practical advance. The hierarchical Director plus Scene Builder setup looks like a reasonable way to scale the planning without losing executability. The external baselines and the LLM jury results give it some grounding beyond self-reference. The soft spots sit in the evaluation layer. Sample sizes, exact prompting for the jury, and inter-annotator numbers are not laid out, so the strength of the human preference is hard to gauge precisely. More importantly, the engine itself is treated as a black box that faithfully runs whatever the graph specifies; there is no audit of collision handling, secondary effects, or fidelity loss in multi-actor scenes. If those artifacts are common, they would directly undercut the validity advantage without violating the syntactic guarantee. This work is aimed at people building controllable video pipelines for simulation or robotics data, not at readers chasing the latest diffusion tricks. Anyone who needs reliable event sequences over visual flair will get usable ideas from the architecture. It is coherent enough on its own terms to warrant a serious referee, provided the revision asks for the missing protocol details and some engine coverage tests. I would send it forward with those requests rather than desk reject.

Referee Report

3 major / 2 minor

Summary. The paper introduces an agentic video generation framework that uses LLMs to construct a formal Graph of Events in Space and Time (GEST) specification of actors, actions, objects, and temporal constraints, which is then executed deterministically in a 3D game engine rather than synthesizing pixels directly. It describes a hierarchical architecture with a Director for narrative planning, a Scene Builder using a round-based state machine, and dedicated Relation Subagents to populate logical and semantic relations. The system is motivated by the failure of staged LLM refinement (0/50 executable outputs) and evaluated in two stages: autonomous generation against procedural baselines via a 3-model LLM jury (79% text and 74% video wins) and seeded generation against VEO 3.1 and WAN 2.2, where human annotations show engine outputs scoring 58% physical validity (vs. 25% and 20%) and 3.75/5 semantic alignment (vs. 2.33 and 1.50).

Significance. If the empirical comparisons hold under rigorous scrutiny, the work offers a substantive alternative to neural video generators by enforcing executability and physical constraints through symbolic planning and simulation. The separation of LLM-based narrative reasoning from programmatic constraint enforcement, along with the use of a full GEST representation, provides a falsifiable and reproducible pathway for complex scene generation that could improve reliability in applications requiring semantic and physical consistency.

major comments (3)

[Evaluation] Evaluation section (human annotation results): The central claim of substantial outperformance on physical validity (58% vs. 25%/20%) and semantic alignment (3.75/5 vs. 2.33/1.50) is load-bearing, yet the manuscript provides no sample size, inter-annotator agreement, definition of physical validity criteria, or controls for prompt engineering and annotator bias. This directly affects the strength of support for the seeded-generation comparison to VEO 3.1 and WAN 2.2.
[Method] Method section (Scene Builder and Relation Subagents): The architecture guarantees syntactic executability via tool calls and the programmatic state backend, but contains no quantitative audit or coverage analysis of simulation artifacts (e.g., collision resolution, secondary effects, or fidelity loss in multi-actor scenes) that the 3D engine may introduce when executing complex GEST specifications. Such artifacts would be invisible to the 'executable by construction' guarantee yet would undermine the reported physical-validity advantage.
[Abstract and Evaluation] Abstract and Evaluation: The staged LLM refinement baseline is reported as failing in 0 of 50 attempts, but the manuscript does not detail the exact prompting strategy, failure modes, or how this baseline was constructed, making it difficult to assess whether the proposed hierarchical architecture's advantages are fairly isolated from prompt-engineering effects.

minor comments (2)

[Introduction] The GEST formalism is referenced throughout but lacks an early formal definition, diagram, or edge-type enumeration that would clarify how the Relation Subagents populate the representation.
[Evaluation] Figure captions and table headers could more explicitly link reported percentages to the exact comparison conditions (autonomous vs. seeded) to avoid reader confusion.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments identify key areas where additional transparency and analysis will strengthen the manuscript. We address each major comment below and commit to revisions that improve the rigor of the evaluation and method sections without altering the core claims.

read point-by-point responses

Referee: [Evaluation] Evaluation section (human annotation results): The central claim of substantial outperformance on physical validity (58% vs. 25%/20%) and semantic alignment (3.75/5 vs. 2.33/1.50) is load-bearing, yet the manuscript provides no sample size, inter-annotator agreement, definition of physical validity criteria, or controls for prompt engineering and annotator bias. This directly affects the strength of support for the seeded-generation comparison to VEO 3.1 and WAN 2.2.

Authors: We agree that these details are necessary to fully support the reported results. In the revised manuscript we will expand the Evaluation section with the sample size used for human annotations, inter-annotator agreement statistics, the explicit criteria employed for physical validity judgments, and a description of controls including randomized video presentation order and blinded annotation procedures. These additions will be placed in a dedicated subsection on the human study protocol. revision: yes
Referee: [Method] Method section (Scene Builder and Relation Subagents): The architecture guarantees syntactic executability via tool calls and the programmatic state backend, but contains no quantitative audit or coverage analysis of simulation artifacts (e.g., collision resolution, secondary effects, or fidelity loss in multi-actor scenes) that the 3D engine may introduce when executing complex GEST specifications. Such artifacts would be invisible to the 'executable by construction' guarantee yet would undermine the reported physical-validity advantage.

Authors: This observation is correct; executability does not automatically preclude simulation-level artifacts. We will add a quantitative audit subsection to the Method section that reports coverage statistics on collision resolutions, secondary physics effects, and any observed fidelity loss across the evaluated multi-actor scenes. This analysis will be tied directly to the human physical-validity annotations to clarify the contribution of the GEST representation versus engine behavior. revision: yes
Referee: [Abstract and Evaluation] Abstract and Evaluation: The staged LLM refinement baseline is reported as failing in 0 of 50 attempts, but the manuscript does not detail the exact prompting strategy, failure modes, or how this baseline was constructed, making it difficult to assess whether the proposed hierarchical architecture's advantages are fairly isolated from prompt-engineering effects.

Authors: We acknowledge that greater detail on the baseline construction is required for fair comparison. The revised manuscript will expand both the Abstract and Evaluation sections to describe the precise prompting strategy used for staged refinement, a categorization of the observed failure modes across the 50 attempts, and the sampling procedure that aligned the baseline prompts with those used for the hierarchical system. This will help isolate the benefits of the Director-Scene Builder separation from prompt-engineering variations. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on external empirical comparisons

full rationale

The paper's load-bearing results derive from two independent evaluation stages: an LLM jury comparing agentic outputs to procedural baselines (79%/74% win rates) and human annotations on seeded generation against VEO 3.1 and WAN 2.2 (58% vs 25%/20% physical validity; 3.75 vs 2.33/1.50 semantic alignment). These metrics are measured externally and do not reduce to quantities defined inside the system. The 'executable by construction' property follows directly from the programmatic state backend and validated tool calls, which is an architectural design choice rather than a self-referential definition of the performance claims. No equations, fitted parameters, or self-citation chains are invoked to derive the reported advantages. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

The central claim depends on the game engine providing sufficient physical fidelity for GEST-specified events and on the LLM reliably generating valid tool calls within the enforced constraints; no free parameters or mathematical axioms are invoked.

invented entities (1)

GEST (Graph of Events in Space and Time) no independent evidence
purpose: Formal structured specification of actors, actions, objects, and temporal constraints that the LLM constructs and the engine executes deterministically
Core representation introduced to guarantee executability and enable full expressive capacity through subagent population of edge types.

pith-pipeline@v0.9.0 · 5617 in / 1318 out tokens · 72523 ms · 2026-05-10T15:18:32.724841+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

NEWTON: Agentic Planning for Physically Grounded Video Generation
cs.CV 2026-05 unverdicted novelty 6.0

NEWTON improves physical accuracy in video generation by deploying a trainable planner that coordinates physics-aware tools and a verifier, raising joint accuracy on VideoPhy-2 without altering the base generators.

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages · cited by 1 Pith paper · 1 internal anchor

[1]

Maintaining knowledge about temporal inter- vals.Communications of the ACM, 26(11):832–843, 1983

James F Allen. Maintaining knowledge about temporal inter- vals.Communications of the ACM, 26(11):832–843, 1983. 3

work page 1983
[2]

Bedlam: A synthetic dataset of bodies exhibit- ing detailed lifelike animated motion

Michael J Black, Priyanka Patel, Joachim Tesch, and Jin- long Yang. Bedlam: A synthetic dataset of bodies exhibit- ing detailed lifelike animated motion. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8726–8737, 2023. 2

work page 2023
[3]

Video generation models as world simulators

Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luh- man, Eric Luhman, Clarence Ng, Ricky Wang, and Aditya Ramesh. Video generation models as world simulators

work page
[4]

[tiny paper] GEST-engine: Controllable multi-actor video synthesis with perfect spatiotemporal annotations

Nicolae Cudlenco, Mihai Masala, and Marius Leordeanu. [tiny paper] GEST-engine: Controllable multi-actor video synthesis with perfect spatiotemporal annotations. InICLR 2026 the 2nd Workshop on World Models: Understanding, Modelling and Scaling, 2026. 1, 2, 3, 6

work page 2026
[5]

Dysen-vdm: Empowering dynamics- aware text-to-video diffusion with llms

Hao Fei, Shengqiong Wu, Wei Ji, Hanwang Zhang, and Tat-Seng Chua. Dysen-vdm: Empowering dynamics- aware text-to-video diffusion with llms. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7641–7653, 2024. 2

work page 2024
[6]

Veo 3 model card

Google. Veo 3 model card. 2025. Accessed: March 04,

work page 2025
[7]

Storyagent: Cus- tomized storytelling video generation via multi-agent col- laboration

Panwen Hu, Jin Jiang, Jianqi Chen, Mingfei Han, Shengcai Liao, Xiaojun Chang, and Xiaodan Liang. Storyagent: Cus- tomized storytelling video generation via multi-agent collab- oration.arXiv preprint arXiv:2411.04925, 2024. 1, 2

work page arXiv 2024
[8]

Action genome: Actions as compositions of spatio- temporal scene graphs

Jingwei Ji, Ranjay Krishna, Li Fei-Fei, and Juan Carlos Niebles. Action genome: Actions as compositions of spatio- temporal scene graphs. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10236–10247, 2020. 2

work page 2020
[9]

Videodirectorgpt: Consistent multi- scene video generation via llm-guided planning.ArXiv, abs/2309.15091, 2023

Han Lin, Abhay Zala, Jaemin Cho, and Mohit Bansal. Videodirectorgpt: Consistent multi-scene video generation via llm-guided planning.arXiv preprint arXiv:2309.15091,

work page arXiv
[10]

Gpt4motion: Scripting physical motions in text-to-video generation via blender-oriented gpt planning

Jiaxi Lv, Yi Huang, Mingfu Yan, Jiancheng Huang, Jianzhuang Liu, Yifan Liu, Yafei Wen, Xiaoxin Chen, and Shifeng Chen. Gpt4motion: Scripting physical motions in text-to-video generation via blender-oriented gpt planning. InProceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 1430–1440, 2024. 2, 3

work page 2024
[11]

From vision to language through graph of events in space and time: An explainable self-supervised approach.arXiv preprint arXiv:2507.04815, 2025

Mihai Masala and Marius Leordeanu. From vision to language through graph of events in space and time: An explainable self-supervised approach.arXiv preprint arXiv:2507.04815, 2025. 6

work page arXiv 2025
[12]

Explaining vision and language through graphs of events in space and time

Mihai Masala, Nicolae Cudlenco, Traian Rebedea, and Mar- ius Leordeanu. Explaining vision and language through graphs of events in space and time. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 2826–2831, 2023. 1, 2, 3, 6

work page 2023
[13]

Generative agents: Interactive simulacra of human behavior

Joon Sung Park, Joseph O’Brien, Carrie Jun Cai, Mered- ith Ringel Morris, Percy Liang, and Michael S Bernstein. Generative agents: Interactive simulacra of human behavior. InProceedings of the 36th annual acm symposium on user interface software and technology, pages 1–22, 2023. 2

work page 2023
[14]

Virtualhome: Simulating household activities via programs

Xavier Puig, Kevin Ra, Marko Boben, Jiaman Li, Tingwu Wang, Sanja Fidler, and Antonio Torralba. Virtualhome: Simulating household activities via programs. InProceed- ings of the IEEE conference on computer vision and pattern recognition, pages 8494–8502, 2018. 2

work page 2018
[15]

Playing for data: Ground truth from computer games

Stephan R Richter, Vibhav Vineet, Stefan Roth, and Vladlen Koltun. Playing for data: Ground truth from computer games. InEuropean conference on computer vision, pages 102–118. Springer, 2016. 2

work page 2016
[16]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video gen- erative models.arXiv preprint arXiv:2503.20314, 2025. 1, 6

work page internal anchor Pith review Pith/arXiv arXiv 2025
[17]

Mavis: A multi-agent framework for long-sequence video storytelling

Qian Wang, Ziqi Huang, Ruoxi Jia, Paul Debevec, and Ning Yu. Mavis: A multi-agent framework for long-sequence video storytelling. InProceedings of the 19th Conference of the European Chapter of the Association for Computa- tional Linguistics (Volume 1: Long Papers), pages 2273– 2295, 2026. 1, 2, 3

work page 2026
[18]

Dreamrunner: Fine-grained compositional story- to-video generation with retrieval-augmented motion adap- tation

Zun Wang, Jialu Li, Han Lin, Jaehong Yoon, and Mohit Bansal. Dreamrunner: Fine-grained compositional story- to-video generation with retrieval-augmented motion adap- tation. InProceedings of the AAAI Conference on Artificial Intelligence, pages 10503–10511, 2026. 1, 2

work page 2026
[19]

Autogen: Enabling next-gen llm applica- tions via multi-agent conversations

Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, et al. Autogen: Enabling next-gen llm applica- tions via multi-agent conversations. InFirst conference on language modeling, 2024. 2

work page 2024
[20]

Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 36:46595–46623, 2023

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 36:46595–46623, 2023. 6 10

work page 2023

[1] [1]

Maintaining knowledge about temporal inter- vals.Communications of the ACM, 26(11):832–843, 1983

James F Allen. Maintaining knowledge about temporal inter- vals.Communications of the ACM, 26(11):832–843, 1983. 3

work page 1983

[2] [2]

Bedlam: A synthetic dataset of bodies exhibit- ing detailed lifelike animated motion

Michael J Black, Priyanka Patel, Joachim Tesch, and Jin- long Yang. Bedlam: A synthetic dataset of bodies exhibit- ing detailed lifelike animated motion. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8726–8737, 2023. 2

work page 2023

[3] [3]

Video generation models as world simulators

Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luh- man, Eric Luhman, Clarence Ng, Ricky Wang, and Aditya Ramesh. Video generation models as world simulators

work page

[4] [4]

[tiny paper] GEST-engine: Controllable multi-actor video synthesis with perfect spatiotemporal annotations

Nicolae Cudlenco, Mihai Masala, and Marius Leordeanu. [tiny paper] GEST-engine: Controllable multi-actor video synthesis with perfect spatiotemporal annotations. InICLR 2026 the 2nd Workshop on World Models: Understanding, Modelling and Scaling, 2026. 1, 2, 3, 6

work page 2026

[5] [5]

Dysen-vdm: Empowering dynamics- aware text-to-video diffusion with llms

Hao Fei, Shengqiong Wu, Wei Ji, Hanwang Zhang, and Tat-Seng Chua. Dysen-vdm: Empowering dynamics- aware text-to-video diffusion with llms. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7641–7653, 2024. 2

work page 2024

[6] [6]

Veo 3 model card

Google. Veo 3 model card. 2025. Accessed: March 04,

work page 2025

[7] [7]

Storyagent: Cus- tomized storytelling video generation via multi-agent col- laboration

Panwen Hu, Jin Jiang, Jianqi Chen, Mingfei Han, Shengcai Liao, Xiaojun Chang, and Xiaodan Liang. Storyagent: Cus- tomized storytelling video generation via multi-agent collab- oration.arXiv preprint arXiv:2411.04925, 2024. 1, 2

work page arXiv 2024

[8] [8]

Action genome: Actions as compositions of spatio- temporal scene graphs

Jingwei Ji, Ranjay Krishna, Li Fei-Fei, and Juan Carlos Niebles. Action genome: Actions as compositions of spatio- temporal scene graphs. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10236–10247, 2020. 2

work page 2020

[9] [9]

Videodirectorgpt: Consistent multi- scene video generation via llm-guided planning.ArXiv, abs/2309.15091, 2023

Han Lin, Abhay Zala, Jaemin Cho, and Mohit Bansal. Videodirectorgpt: Consistent multi-scene video generation via llm-guided planning.arXiv preprint arXiv:2309.15091,

work page arXiv

[10] [10]

Gpt4motion: Scripting physical motions in text-to-video generation via blender-oriented gpt planning

Jiaxi Lv, Yi Huang, Mingfu Yan, Jiancheng Huang, Jianzhuang Liu, Yifan Liu, Yafei Wen, Xiaoxin Chen, and Shifeng Chen. Gpt4motion: Scripting physical motions in text-to-video generation via blender-oriented gpt planning. InProceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 1430–1440, 2024. 2, 3

work page 2024

[11] [11]

From vision to language through graph of events in space and time: An explainable self-supervised approach.arXiv preprint arXiv:2507.04815, 2025

Mihai Masala and Marius Leordeanu. From vision to language through graph of events in space and time: An explainable self-supervised approach.arXiv preprint arXiv:2507.04815, 2025. 6

work page arXiv 2025

[12] [12]

Explaining vision and language through graphs of events in space and time

Mihai Masala, Nicolae Cudlenco, Traian Rebedea, and Mar- ius Leordeanu. Explaining vision and language through graphs of events in space and time. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 2826–2831, 2023. 1, 2, 3, 6

work page 2023

[13] [13]

Generative agents: Interactive simulacra of human behavior

Joon Sung Park, Joseph O’Brien, Carrie Jun Cai, Mered- ith Ringel Morris, Percy Liang, and Michael S Bernstein. Generative agents: Interactive simulacra of human behavior. InProceedings of the 36th annual acm symposium on user interface software and technology, pages 1–22, 2023. 2

work page 2023

[14] [14]

Virtualhome: Simulating household activities via programs

Xavier Puig, Kevin Ra, Marko Boben, Jiaman Li, Tingwu Wang, Sanja Fidler, and Antonio Torralba. Virtualhome: Simulating household activities via programs. InProceed- ings of the IEEE conference on computer vision and pattern recognition, pages 8494–8502, 2018. 2

work page 2018

[15] [15]

Playing for data: Ground truth from computer games

Stephan R Richter, Vibhav Vineet, Stefan Roth, and Vladlen Koltun. Playing for data: Ground truth from computer games. InEuropean conference on computer vision, pages 102–118. Springer, 2016. 2

work page 2016

[16] [16]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video gen- erative models.arXiv preprint arXiv:2503.20314, 2025. 1, 6

work page internal anchor Pith review Pith/arXiv arXiv 2025

[17] [17]

Mavis: A multi-agent framework for long-sequence video storytelling

Qian Wang, Ziqi Huang, Ruoxi Jia, Paul Debevec, and Ning Yu. Mavis: A multi-agent framework for long-sequence video storytelling. InProceedings of the 19th Conference of the European Chapter of the Association for Computa- tional Linguistics (Volume 1: Long Papers), pages 2273– 2295, 2026. 1, 2, 3

work page 2026

[18] [18]

Dreamrunner: Fine-grained compositional story- to-video generation with retrieval-augmented motion adap- tation

Zun Wang, Jialu Li, Han Lin, Jaehong Yoon, and Mohit Bansal. Dreamrunner: Fine-grained compositional story- to-video generation with retrieval-augmented motion adap- tation. InProceedings of the AAAI Conference on Artificial Intelligence, pages 10503–10511, 2026. 1, 2

work page 2026

[19] [19]

Autogen: Enabling next-gen llm applica- tions via multi-agent conversations

Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, et al. Autogen: Enabling next-gen llm applica- tions via multi-agent conversations. InFirst conference on language modeling, 2024. 2

work page 2024

[20] [20]

Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 36:46595–46623, 2023

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 36:46595–46623, 2023. 6 10

work page 2023