arxiv: 2604.24842 · v1 · submitted 2026-04-27 · 💻 cs.AI · cs.MA· cs.MM

Co-Director: Agentic Generative Video Storytelling

Yale Song , Yiwen Song , Nick Losier , Nathan Hodson , Ye Jin , Rhyard Zhu , Yan Xu , Daniel Vlasic

show 8 more authors

Carina Claassen Jasmine Leon Khanh G. LeViet Zack Chomyn Joe Timmons Brett Slatkin Scott Penberthy Tomas Pfister

This is my paper

Pith reviewed 2026-05-08 03:15 UTC · model grok-4.3

classification 💻 cs.AI cs.MAcs.MM

keywords agentic video generationmulti-agent frameworksemantic coherencevideo storytellingdiffusion modelsmulti-armed banditGenAD-Benchgenerative narrative

0 comments

The pith

Co-Director models video storytelling as a global optimization problem solved by a hierarchical multi-agent system that pairs bandit-driven exploration with local refinement to maintain narrative coherence.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Diffusion models can now produce high-quality individual video clips, yet chaining them into longer stories has consistently produced semantic drift and cascading errors from independent handcrafted prompts. Co-Director reframes the entire task as a single optimization process instead of sequential modules. A global multi-armed bandit searches for promising creative directions across the full narrative while local multimodal agents refine each segment to preserve identity and sequence consistency. The approach is evaluated on a new 400-scenario benchmark of fictional product advertisements and reports stronger performance than prior agentic pipelines. If the method holds, it supplies a scalable route from isolated clip generation to reliable multi-shot cinematic output.

Core claim

The paper claims that formalizing video storytelling as a global optimization problem, implemented through hierarchical parameterization, allows a multi-agent system to balance exploration of novel narrative strategies with exploitation of consistent configurations, thereby preventing the semantic drift and cascading failures that arise in chained prompting pipelines.

What carries the argument

hierarchical parameterization consisting of a global multi-armed bandit that selects promising creative directions and a local multimodal self-refinement loop that corrects identity drift and enforces sequence-level consistency

If this is right

Co-Director outperforms state-of-the-art baselines on the 400-scenario GenAD-Bench dataset.
The framework balances exploration of new narrative directions with exploitation of consistent creative choices.
It supplies a principled alternative to independent handcrafted prompting in agentic video pipelines.
The method is presented as generalizing beyond advertising to broader cinematic narratives.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same bandit-plus-refinement structure could be tested on longer-form or interactive video generation tasks where drift accumulates over dozens of shots.
If the optimization remains stable, the approach may reduce reliance on human prompt engineering for multi-shot video projects.
The dual global-local design suggests a template for other generative domains that suffer from coherence loss when modules are chained, such as sequential image or audio synthesis.

Load-bearing premise

The combination of global multi-armed bandit exploration and local multimodal self-refinement will reliably prevent semantic drift and cascading failures across diverse video sequences without introducing new inconsistencies or requiring extensive per-scenario tuning.

What would settle it

Long video sequences generated by Co-Director that exhibit increasing identity drift or coherence breakdowns on scenarios outside the GenAD-Bench advertising set would falsify the claim that the hierarchical optimization prevents such failures.

read the original abstract

While diffusion models generate high-fidelity video clips, transforming them into coherent storytelling engines remains challenging. Current agentic pipelines automate this via chained modules but suffer from semantic drift and cascading failures due to independent, handcrafted prompting. We present Co-Director, a hierarchical multi-agent framework formalizing video storytelling as a global optimization problem. To ensure semantic coherence, we introduce hierarchical parameterization: a multi-armed bandit globally identifies promising creative directions, while a local multimodal self-refinement loop mitigates identity drift and ensures sequence-level consistency. This balances the exploration of novel narrative strategies with the exploitation of effective creative configurations. For evaluation, we introduce GenAD-Bench, a 400-scenario dataset of fictional products for personalized advertising. Experiments demonstrate that Co-Director significantly outperforms state-of-the-art baselines, offering a principled approach that seamlessly generalizes to broader cinematic narratives. Project Page: https://co-director-agent.github.io/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Co-Director gives a hierarchical bandit and refinement approach to agentic video storytelling, but the abstract's performance claims lack supporting details and the generalization is untested.

read the letter

Colleague, The main thing to know about this paper is that it presents Co-Director, a hierarchical multi-agent system for turning diffusion-based video clips into coherent stories. It uses a global multi-armed bandit to pick promising narrative strategies and pairs that with a local self-refinement loop to maintain consistency and avoid identity drift. The paper does a good job laying out the issues with existing chained agent pipelines, like semantic drift from handcrafted prompts. Formalizing storytelling as a global optimization problem is a clean way to think about it, and the introduction of GenAD-Bench, with its 400 fictional product scenarios for personalized ads, gives a focused testbed that could be handy for others in the field. Where it gets soft is in the results. The abstract says it significantly outperforms baselines, but there are no metrics, no list of what those baselines are, no error bars or tests mentioned. Without that, it's difficult to gauge how much the hierarchical setup actually helps. The claim about seamless generalization to broader cinematic narratives also sits on thin ground because everything is evaluated on short advertising sequences. No transfer tests or longer narrative experiments are referenced, so that part reads as an extrapolation rather than a demonstrated strength. This work is for researchers building agentic pipelines for generative video, especially those focused on practical applications in entertainment or marketing. Someone looking for architectural ideas to reduce cascading failures in video generation would get value from the description of the bandit and refinement components. It should go to peer review. The idea is distinct from prior work and the problem matters, but the referees will have to dig into the methods and results to see if the performance claims check out and whether the system really scales beyond the ad domain. If the full paper includes reproducible details and ablations, it could be a useful reference.

Referee Report

2 major / 1 minor

Summary. The paper proposes Co-Director, a hierarchical multi-agent framework that treats video storytelling as a global optimization problem. A multi-armed bandit explores creative directions at the global level while a local multimodal self-refinement loop mitigates semantic drift and identity inconsistencies in chained diffusion-based video generation. The authors introduce GenAD-Bench, a 400-scenario dataset focused on fictional-product personalized advertising, and claim that Co-Director significantly outperforms state-of-the-art baselines while offering a principled approach that generalizes to broader cinematic narratives.

Significance. If the empirical results hold, the work provides a concrete architecture for addressing coherence failures in agentic video pipelines, which is a recognized bottleneck in long-form generation. The GenAD-Bench benchmark fills a gap for evaluating personalized advertising scenarios. The hierarchical parameterization (global exploration via MAB combined with local self-refinement) is a plausible mechanism for balancing novelty and consistency, though its robustness beyond the advertising domain remains untested.

major comments (2)

[Abstract] Abstract: The claim that Co-Director 'significantly outperforms state-of-the-art baselines' is asserted without any quantitative metrics, baseline names, statistical significance tests, or error bars, preventing assessment of whether the results actually support the central performance claim.
[Experiments] Experiments section: All reported evaluations are confined to GenAD-Bench (400 fictional-product advertising scenarios). No ablations on narrative length, character-driven stories, or transfer experiments to non-advertising cinematic tasks are provided, so the assertion of seamless generalization to 'broader cinematic narratives' rests on untested extrapolation rather than demonstrated robustness.

minor comments (1)

[Method] The description of the multi-armed bandit arms and the precise objective of the local self-refinement loop would benefit from explicit pseudocode or algorithmic details to support reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which highlight important aspects of clarity and scope. We address each major comment below and indicate the planned revisions.

read point-by-point responses

Referee: [Abstract] Abstract: The claim that Co-Director 'significantly outperforms state-of-the-art baselines' is asserted without any quantitative metrics, baseline names, statistical significance tests, or error bars, preventing assessment of whether the results actually support the central performance claim.

Authors: We agree that the abstract would be strengthened by greater specificity. The detailed results, including metrics, baseline names, error bars, and statistical tests, appear in the Experiments section. In the revised version, we will update the abstract to name the primary baselines and include a concise summary of key quantitative gains (e.g., relative improvements in coherence and consistency scores), while retaining the full tables and analysis in the body. revision: yes
Referee: [Experiments] Experiments section: All reported evaluations are confined to GenAD-Bench (400 fictional-product advertising scenarios). No ablations on narrative length, character-driven stories, or transfer experiments to non-advertising cinematic tasks are provided, so the assertion of seamless generalization to 'broader cinematic narratives' rests on untested extrapolation rather than demonstrated robustness.

Authors: We acknowledge that evaluations are focused on GenAD-Bench and that direct transfer experiments to other domains are absent. The benchmark was designed to target a recognized gap in personalized advertising narratives. In revision, we will add a Limitations subsection that explicitly discusses the current scope, qualifies the generalization language to reflect demonstrated results on the hierarchical framework, and incorporates additional ablations on narrative complexity within the advertising domain. Full cross-domain transfer studies require new datasets and are identified as future work. revision: partial

Circularity Check

0 steps flagged

No significant circularity in the architectural framework

full rationale

The paper presents a new hierarchical multi-agent system (Co-Director) for video storytelling as a global optimization problem solved via multi-armed bandit exploration and local multimodal self-refinement. No mathematical derivations, closed-form equations, or parameter-fitting procedures are described that could reduce predictions or results to their own inputs by construction. The contribution is an empirical systems architecture evaluated on the introduced GenAD-Bench dataset; claims of outperformance and generalization rest on experimental results rather than any self-referential derivation chain or self-citation load-bearing premise. This is a standard non-circular systems paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Abstract supplies no explicit free parameters, mathematical axioms, or independently evidenced entities; the framework components are introduced as novel design choices.

invented entities (1)

hierarchical parameterization no independent evidence
purpose: To balance global narrative exploration with local consistency enforcement
Core organizing idea of the Co-Director framework introduced in the abstract.

pith-pipeline@v0.9.0 · 5510 in / 1129 out tokens · 48610 ms · 2026-05-08T03:15:41.537734+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

31 extracted references · 1 canonical work pages

[1]

Brand & Product Synthesis:The LLM generated 50 unique brands and assigned exactly 4 distinct products to each. Products were explicitly prompted to span a wide range of form factors, including small consumer goods, large industrial equipment, and intangible offerings (e.g., software and subscription services)
[2]

Paired Scenario Generation:For each of the 200 products, the LLM generated two contrasting Table 5|Evaluation onViStoryBench-Lite. Method Style Consistency Character Consistency Quality & Diversity OCCM Prompt Alignment Cross Self Cross Self Aesthetics Inception Scene Camera GCA LCA Avg MMStoryAgent 0.261 0.661 0.385 0.5985.9108.090 58.236 2.965 2.452 1.6...

work page arXiv
[3]

Subsequently, it generated product reference images conditioned on both the textual product description and the respective brand logo to ensure visual-semantic integration

Visual Asset Generation:The text-to-image model generated a distinct vector-style logo for each of the 50 brands. Subsequently, it generated product reference images conditioned on both the textual product description and the respective brand logo to ensure visual-semantic integration
[4]

Generate a video advertisement. [SIX-POINT PROMPT]. Reserve the last 1 second for an ending scene highlighting the provided brand logo image

Manual Verification & Quality Control:To ensure high-fidelity inputs for the product ref- erence generation stage, we first manually verified and corrected all brand logos. Following the generation of these product reference visuals, approximately 15% were regenerated to address improper logo integration, logical inconsistencies, or unintended visual memo...

2024
[5]

The system is not searching 36 distinct paths, but rather evaluating 10 independent parameters

Factored Rewards:By utilizing a factored reward signal𝑅=(𝑟 𝑐𝑠, 𝑟𝑛𝑚, 𝑟𝑎𝑎), a single execution step 𝑡 simultaneously updates the expected values across all three independent creative axes. The system is not searching 36 distinct paths, but rather evaluating 10 independent parameters
[6]

functional utility

LLM-Driven Warm Start:We initialize the expected values using an LLM, safely biasing early exploration toward theoretically sound configurations and pruning the vast majority of sub-optimal combinations before iteration even begins. Consequently, as demonstrated in our empirical results (Fig. 2), an iteration budget of𝑇= 4is sufficient for the MAB to rapi...

2025
[7]

** Hook Quality (0 -20 pts ) :** Does the story i m m e d i a t e l y grab a t t e n t i o n within the first 1 -2 seconds ? Is there an element of curiosity , emotion , or visual i nt ri gu e that pr ev en ts the viewer from s c r o l l i n g ? A weak , generic opening gets a low score
[8]

** N a r r a t i v e Arc & C oh es io n (0 -20 pts ) :** Does the s t o r y l i n e present a clear , simple , and co mp le te n a r r a t i v e ( e . g . , setup , confrontation , r e s o l u t i o n ) ? Do the scenes flow logically , or does the p r o g r e s s i o n feel d i s j o i n t e d or c o n f u s i n g ? The story must be fully u n d e r s t a...
[9]

tacked on

** Product I n t e g r a t i o n (0 -20 pts ) :** Is the product woven into the n a r r a t i v e in a way that feels natural and e s s e n t i a l ? Does the product help resolve the core c on fl ic t or enhance the e m o t i o n a l peak of the story ? A s t o r y l i n e where the product feels " tacked on " or i r r e l e v a n t gets a low score
[10]

** E n g a g e m e n t & E m o t i o n a l R e s o n a n c e (0 -20 pts ) :** Does the story evoke a specific , desired emotion ( e . g . , joy , humor , inspiration , relief ) ? Is the core concept i n t e r e s t i n g and m e m o r a b l e ? Does it create a po si ti ve a s s o c i a t i o n with the brand and product ?
[11]

good " or

** Prompt A d h e r e n c e (0 -20 pts ) :** How well does the s t o r y l i n e capture the key e le me nt s of the o ri gi na l user prompt , i n c l u d i n g the product , target audience , and core message ? Does it align with the r e q u e s t e d tone and a ff in it y group ? ## TONE & B EH AV IO R G U I D E L I N E S * ** Be De ci si ve :** Your f...
[12]

** Visual C o n s i s t e n c y & Co he si on (0 -20 pts ) :** * ** C h a r a c t e r R e f e r e n c e Match ( CR IT IC AL ) :** Do the c h a r a c t e r s in the g e n e r a t e d scenes ac tu al ly look like the p ro vi de d ** C h a r a c t e r R e f e r e n c e Image **? Check for facial structure , ethnicity , and general vibe . * ** Product R e f e...
[13]

** N a r r a t i v e Flow & Clarity (0 -20 pts ) :** * Do the four images tell a clear , s e q u e n t i a l story ? Is there a logical p r o g r e s s i o n from one image to the next ? * Could a viewer u n d e r s t a n d the basic story arc ( beginning , middle , end ) just by looking at these four frames without any other context ?
[14]

** Product Appeal & I n t e g r a t i o n (0 -20 pts ) :** * How is the product p r e s e n t e d ? Does it look a p p e a l i n g and d e s i r a b l e ? * Is the product the " hero " of the story ? Is its role clear and e s s e n t i a l to the narrative , or does it feel i n c i d e n t a l ?
[15]

** E n g a g e m e n t & E m o t i o n a l Impact (0 -20 pts ) :** * Are the images v is ua ll y c o m p e l l i n g ? Are the composition , colors , and subject matter i n t e r e s t i n g ? * As a set , do the images create a sp ec if ic mood or evoke an e m o t i o n a l r es po ns e that is re le va nt to the product and target au di en ce ?
[16]

b r e a k d o w n

** Prompt A d h e r e n c e (0 -20 pts ) :** * ** D e m o g r a p h i c & S i t u a t i o n a l R e s o n a n c e :** Do the ph ys ic al se tt in gs and e n v i r o n m e n t a l details c o r r e c t l y reflect the g e o g r a p h i c a l and s i t u a t i o n a l context of the target a ud ie nc e ( e . g . , if d e m o g r a p h i c s say Alaska , are...
[17]

** C o h e r e n c e (0 -20 points ) :** How well does the story flow ? Is the n a r r a t i v e clear and easy to follow ?
[18]

** Visual Quality (0 -20 points ) :** How good are the a e s t h e t i c s ? Are the images / clips high - quality , v is ua ll y appealing , and c o n s i s t e n t in style ?
[19]

** E n g a g e m e n t (0 -20 points ) :** How c a p t i v a t i n g is the video ? Does it grab your a t t e n t i o n and make you want to keep w at ch in g ?
[20]

** Prompt A d h e r e n c e (0 -20 pts ) :** * ** D e m o g r a p h i c & S i t u a t i o n a l R e s o n a n c e :** Do the ph ys ic al se tt in gs and e n v i r o n m e n t a l details c o r r e c t l y reflect the g e o g r a p h i c a l and s i t u a t i o n a l context of the target a ud ie nc e ? * ** Key Message :** How well do the images fulfill t...
[21]

f lo at in g

** Logical & P hy si ca l C o n s i s t e n c y (0 -20 pts ) :** Check the video for v i o l a t i o n s of real - world i n t u i t i o n and ph ys ic al laws . * ** N e w t o n i a n Physics & K i n e m a t i c s :** Does motion adhere to gravity , momentum , and inertia ? ( e . g . , check for " f lo at in g " footsteps , u n n a t u r a l acceleration...
[22]

** C re at iv e S tr at eg y E ff ic ac y (0 -100) :** Based on ** Laskey , Day , and Crask ’ s (1989) ** ty po lo gy of cr ea ti ve s t r a t e g i e s : * Did the chosen s tr at eg y ( Informational , T ra ns fo rm at io na l , or C o m p a r a t i v e ) s u c c e s s f u l l y c o m m u n i c a t e the value p r o p o s i t i o n ? * If ** I n f o r m ...

1989
[23]

n a r r a t i v e t r a n s p o r t a t i o n

** N a r r a t i v e Mode E ff ic ac y (0 -100) :** Based on ** Escalas ’ (2004) ** theory of n a r r a t i v e p r o c e s s i n g and ** Green & Brock ’ s (2000) ** concept of " n a r r a t i v e t r a n s p o r t a t i o n ": * Did the s t r u c t u r e ( Analytical , Vignette , or N a r r a t i v e Drama ) e f f e c t i v e l y t r a n s p o r t the v...

2004
[24]

u n n a t u r a l

** A e s t h e t i c A r c h e t y p e Ef fi ca cy (0 -100) :** Based on ** Zettl ’ s (2016) ** Applied Media A e s t h e t i c s and ** Lang ’ s (2000) ** limited c ap ac it y model of message 33 Co-Director: Agentic Generative Video Storytelling p r o c e s s i n g : * Did the visual and au di to ry choices ( lighting , motion , audio ) align with the b...

2016
[25]

** ‘ feedback ‘**: A sharp , c ri ti ca l review citing sp ec if ic t i m e s t a m p s or frames where errors occur
[26]

** ‘ primary_fault ‘**: The stage most r e s p o n s i b l e for the errors ( ‘ ’ storyline ’ ‘ , ‘’ image ’ ‘ , or ‘’ video ’ ‘)
[27]

b r e a k d o w n

** ‘ a c t i o n a b l e _ f e e d b a c k ‘**: A direct i n s t r u c t i o n to fix the sp eci fi c error . Write this as a command to the AI agent r e s p o n s i b l e for that stage . ## OUTPUT FORMAT You ** MUST ** format your r es po ns e as a single JSON object . Do not include any m ar kd ow n f o r m a t t i n g ( like ‘‘‘ json ) or c o n v e r ...
[28]

[ Image 1]: The r e f e r e n c e Brand Logo
[29]

[ Image 2]: The r e f e r e n c e Product Image
[30]

[ Text C o n s t r a i n t s ]: A six - point prompt de fi ni ng the Brand , Product , Target Gender , Target Age , Target Location , and Target I nt er es t
[31]

You must ev al ua te the video across four d is ti nc t d i m e n s i o n s and output a score from 0 to 100 for each

[ G e n e r a t e d Video ]: The final AI - g e n e r a t e d video a d v e r t i s e m e n t . You must ev al ua te the video across four d is ti nc t d i m e n s i o n s and output a score from 0 to 100 for each . ### E V A L U A T I O N RUBRIC 34 Co-Director: Agentic Generative Video Storytelling **1. Visual Asset Fi de li ty ( VAF ) ** M ea su res the...