Co-Director: Agentic Generative Video Storytelling
Pith reviewed 2026-05-08 03:15 UTC · model grok-4.3
The pith
Co-Director models video storytelling as a global optimization problem solved by a hierarchical multi-agent system that pairs bandit-driven exploration with local refinement to maintain narrative coherence.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that formalizing video storytelling as a global optimization problem, implemented through hierarchical parameterization, allows a multi-agent system to balance exploration of novel narrative strategies with exploitation of consistent configurations, thereby preventing the semantic drift and cascading failures that arise in chained prompting pipelines.
What carries the argument
hierarchical parameterization consisting of a global multi-armed bandit that selects promising creative directions and a local multimodal self-refinement loop that corrects identity drift and enforces sequence-level consistency
If this is right
- Co-Director outperforms state-of-the-art baselines on the 400-scenario GenAD-Bench dataset.
- The framework balances exploration of new narrative directions with exploitation of consistent creative choices.
- It supplies a principled alternative to independent handcrafted prompting in agentic video pipelines.
- The method is presented as generalizing beyond advertising to broader cinematic narratives.
Where Pith is reading between the lines
- The same bandit-plus-refinement structure could be tested on longer-form or interactive video generation tasks where drift accumulates over dozens of shots.
- If the optimization remains stable, the approach may reduce reliance on human prompt engineering for multi-shot video projects.
- The dual global-local design suggests a template for other generative domains that suffer from coherence loss when modules are chained, such as sequential image or audio synthesis.
Load-bearing premise
The combination of global multi-armed bandit exploration and local multimodal self-refinement will reliably prevent semantic drift and cascading failures across diverse video sequences without introducing new inconsistencies or requiring extensive per-scenario tuning.
What would settle it
Long video sequences generated by Co-Director that exhibit increasing identity drift or coherence breakdowns on scenarios outside the GenAD-Bench advertising set would falsify the claim that the hierarchical optimization prevents such failures.
read the original abstract
While diffusion models generate high-fidelity video clips, transforming them into coherent storytelling engines remains challenging. Current agentic pipelines automate this via chained modules but suffer from semantic drift and cascading failures due to independent, handcrafted prompting. We present Co-Director, a hierarchical multi-agent framework formalizing video storytelling as a global optimization problem. To ensure semantic coherence, we introduce hierarchical parameterization: a multi-armed bandit globally identifies promising creative directions, while a local multimodal self-refinement loop mitigates identity drift and ensures sequence-level consistency. This balances the exploration of novel narrative strategies with the exploitation of effective creative configurations. For evaluation, we introduce GenAD-Bench, a 400-scenario dataset of fictional products for personalized advertising. Experiments demonstrate that Co-Director significantly outperforms state-of-the-art baselines, offering a principled approach that seamlessly generalizes to broader cinematic narratives. Project Page: https://co-director-agent.github.io/
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Co-Director, a hierarchical multi-agent framework that treats video storytelling as a global optimization problem. A multi-armed bandit explores creative directions at the global level while a local multimodal self-refinement loop mitigates semantic drift and identity inconsistencies in chained diffusion-based video generation. The authors introduce GenAD-Bench, a 400-scenario dataset focused on fictional-product personalized advertising, and claim that Co-Director significantly outperforms state-of-the-art baselines while offering a principled approach that generalizes to broader cinematic narratives.
Significance. If the empirical results hold, the work provides a concrete architecture for addressing coherence failures in agentic video pipelines, which is a recognized bottleneck in long-form generation. The GenAD-Bench benchmark fills a gap for evaluating personalized advertising scenarios. The hierarchical parameterization (global exploration via MAB combined with local self-refinement) is a plausible mechanism for balancing novelty and consistency, though its robustness beyond the advertising domain remains untested.
major comments (2)
- [Abstract] Abstract: The claim that Co-Director 'significantly outperforms state-of-the-art baselines' is asserted without any quantitative metrics, baseline names, statistical significance tests, or error bars, preventing assessment of whether the results actually support the central performance claim.
- [Experiments] Experiments section: All reported evaluations are confined to GenAD-Bench (400 fictional-product advertising scenarios). No ablations on narrative length, character-driven stories, or transfer experiments to non-advertising cinematic tasks are provided, so the assertion of seamless generalization to 'broader cinematic narratives' rests on untested extrapolation rather than demonstrated robustness.
minor comments (1)
- [Method] The description of the multi-armed bandit arms and the precise objective of the local self-refinement loop would benefit from explicit pseudocode or algorithmic details to support reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which highlight important aspects of clarity and scope. We address each major comment below and indicate the planned revisions.
read point-by-point responses
-
Referee: [Abstract] Abstract: The claim that Co-Director 'significantly outperforms state-of-the-art baselines' is asserted without any quantitative metrics, baseline names, statistical significance tests, or error bars, preventing assessment of whether the results actually support the central performance claim.
Authors: We agree that the abstract would be strengthened by greater specificity. The detailed results, including metrics, baseline names, error bars, and statistical tests, appear in the Experiments section. In the revised version, we will update the abstract to name the primary baselines and include a concise summary of key quantitative gains (e.g., relative improvements in coherence and consistency scores), while retaining the full tables and analysis in the body. revision: yes
-
Referee: [Experiments] Experiments section: All reported evaluations are confined to GenAD-Bench (400 fictional-product advertising scenarios). No ablations on narrative length, character-driven stories, or transfer experiments to non-advertising cinematic tasks are provided, so the assertion of seamless generalization to 'broader cinematic narratives' rests on untested extrapolation rather than demonstrated robustness.
Authors: We acknowledge that evaluations are focused on GenAD-Bench and that direct transfer experiments to other domains are absent. The benchmark was designed to target a recognized gap in personalized advertising narratives. In revision, we will add a Limitations subsection that explicitly discusses the current scope, qualifies the generalization language to reflect demonstrated results on the hierarchical framework, and incorporates additional ablations on narrative complexity within the advertising domain. Full cross-domain transfer studies require new datasets and are identified as future work. revision: partial
Circularity Check
No significant circularity in the architectural framework
full rationale
The paper presents a new hierarchical multi-agent system (Co-Director) for video storytelling as a global optimization problem solved via multi-armed bandit exploration and local multimodal self-refinement. No mathematical derivations, closed-form equations, or parameter-fitting procedures are described that could reduce predictions or results to their own inputs by construction. The contribution is an empirical systems architecture evaluated on the introduced GenAD-Bench dataset; claims of outperformance and generalization rest on experimental results rather than any self-referential derivation chain or self-citation load-bearing premise. This is a standard non-circular systems paper.
Axiom & Free-Parameter Ledger
invented entities (1)
-
hierarchical parameterization
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Brand & Product Synthesis:The LLM generated 50 unique brands and assigned exactly 4 distinct products to each. Products were explicitly prompted to span a wide range of form factors, including small consumer goods, large industrial equipment, and intangible offerings (e.g., software and subscription services)
-
[2]
Paired Scenario Generation:For each of the 200 products, the LLM generated two contrasting Table 5|Evaluation onViStoryBench-Lite. Method Style Consistency Character Consistency Quality & Diversity OCCM Prompt Alignment Cross Self Cross Self Aesthetics Inception Scene Camera GCA LCA Avg MMStoryAgent 0.261 0.661 0.385 0.5985.9108.090 58.236 2.965 2.452 1.6...
-
[3]
Subsequently, it generated product reference images conditioned on both the textual product description and the respective brand logo to ensure visual-semantic integration
Visual Asset Generation:The text-to-image model generated a distinct vector-style logo for each of the 50 brands. Subsequently, it generated product reference images conditioned on both the textual product description and the respective brand logo to ensure visual-semantic integration
-
[4]
Generate a video advertisement. [SIX-POINT PROMPT]. Reserve the last 1 second for an ending scene highlighting the provided brand logo image
Manual Verification & Quality Control:To ensure high-fidelity inputs for the product ref- erence generation stage, we first manually verified and corrected all brand logos. Following the generation of these product reference visuals, approximately 15% were regenerated to address improper logo integration, logical inconsistencies, or unintended visual memo...
2024
-
[5]
The system is not searching 36 distinct paths, but rather evaluating 10 independent parameters
Factored Rewards:By utilizing a factored reward signal𝑅=(𝑟 𝑐𝑠, 𝑟𝑛𝑚, 𝑟𝑎𝑎), a single execution step 𝑡 simultaneously updates the expected values across all three independent creative axes. The system is not searching 36 distinct paths, but rather evaluating 10 independent parameters
-
[6]
functional utility
LLM-Driven Warm Start:We initialize the expected values using an LLM, safely biasing early exploration toward theoretically sound configurations and pruning the vast majority of sub-optimal combinations before iteration even begins. Consequently, as demonstrated in our empirical results (Fig. 2), an iteration budget of𝑇= 4is sufficient for the MAB to rapi...
2025
-
[7]
** Hook Quality (0 -20 pts ) :** Does the story i m m e d i a t e l y grab a t t e n t i o n within the first 1 -2 seconds ? Is there an element of curiosity , emotion , or visual i nt ri gu e that pr ev en ts the viewer from s c r o l l i n g ? A weak , generic opening gets a low score
-
[8]
** N a r r a t i v e Arc & C oh es io n (0 -20 pts ) :** Does the s t o r y l i n e present a clear , simple , and co mp le te n a r r a t i v e ( e . g . , setup , confrontation , r e s o l u t i o n ) ? Do the scenes flow logically , or does the p r o g r e s s i o n feel d i s j o i n t e d or c o n f u s i n g ? The story must be fully u n d e r s t a...
-
[9]
tacked on
** Product I n t e g r a t i o n (0 -20 pts ) :** Is the product woven into the n a r r a t i v e in a way that feels natural and e s s e n t i a l ? Does the product help resolve the core c on fl ic t or enhance the e m o t i o n a l peak of the story ? A s t o r y l i n e where the product feels " tacked on " or i r r e l e v a n t gets a low score
-
[10]
** E n g a g e m e n t & E m o t i o n a l R e s o n a n c e (0 -20 pts ) :** Does the story evoke a specific , desired emotion ( e . g . , joy , humor , inspiration , relief ) ? Is the core concept i n t e r e s t i n g and m e m o r a b l e ? Does it create a po si ti ve a s s o c i a t i o n with the brand and product ?
-
[11]
good " or
** Prompt A d h e r e n c e (0 -20 pts ) :** How well does the s t o r y l i n e capture the key e le me nt s of the o ri gi na l user prompt , i n c l u d i n g the product , target audience , and core message ? Does it align with the r e q u e s t e d tone and a ff in it y group ? ## TONE & B EH AV IO R G U I D E L I N E S * ** Be De ci si ve :** Your f...
-
[12]
** Visual C o n s i s t e n c y & Co he si on (0 -20 pts ) :** * ** C h a r a c t e r R e f e r e n c e Match ( CR IT IC AL ) :** Do the c h a r a c t e r s in the g e n e r a t e d scenes ac tu al ly look like the p ro vi de d ** C h a r a c t e r R e f e r e n c e Image **? Check for facial structure , ethnicity , and general vibe . * ** Product R e f e...
-
[13]
** N a r r a t i v e Flow & Clarity (0 -20 pts ) :** * Do the four images tell a clear , s e q u e n t i a l story ? Is there a logical p r o g r e s s i o n from one image to the next ? * Could a viewer u n d e r s t a n d the basic story arc ( beginning , middle , end ) just by looking at these four frames without any other context ?
-
[14]
** Product Appeal & I n t e g r a t i o n (0 -20 pts ) :** * How is the product p r e s e n t e d ? Does it look a p p e a l i n g and d e s i r a b l e ? * Is the product the " hero " of the story ? Is its role clear and e s s e n t i a l to the narrative , or does it feel i n c i d e n t a l ?
-
[15]
** E n g a g e m e n t & E m o t i o n a l Impact (0 -20 pts ) :** * Are the images v is ua ll y c o m p e l l i n g ? Are the composition , colors , and subject matter i n t e r e s t i n g ? * As a set , do the images create a sp ec if ic mood or evoke an e m o t i o n a l r es po ns e that is re le va nt to the product and target au di en ce ?
-
[16]
b r e a k d o w n
** Prompt A d h e r e n c e (0 -20 pts ) :** * ** D e m o g r a p h i c & S i t u a t i o n a l R e s o n a n c e :** Do the ph ys ic al se tt in gs and e n v i r o n m e n t a l details c o r r e c t l y reflect the g e o g r a p h i c a l and s i t u a t i o n a l context of the target a ud ie nc e ( e . g . , if d e m o g r a p h i c s say Alaska , are...
-
[17]
** C o h e r e n c e (0 -20 points ) :** How well does the story flow ? Is the n a r r a t i v e clear and easy to follow ?
-
[18]
** Visual Quality (0 -20 points ) :** How good are the a e s t h e t i c s ? Are the images / clips high - quality , v is ua ll y appealing , and c o n s i s t e n t in style ?
-
[19]
** E n g a g e m e n t (0 -20 points ) :** How c a p t i v a t i n g is the video ? Does it grab your a t t e n t i o n and make you want to keep w at ch in g ?
-
[20]
** Prompt A d h e r e n c e (0 -20 pts ) :** * ** D e m o g r a p h i c & S i t u a t i o n a l R e s o n a n c e :** Do the ph ys ic al se tt in gs and e n v i r o n m e n t a l details c o r r e c t l y reflect the g e o g r a p h i c a l and s i t u a t i o n a l context of the target a ud ie nc e ? * ** Key Message :** How well do the images fulfill t...
-
[21]
f lo at in g
** Logical & P hy si ca l C o n s i s t e n c y (0 -20 pts ) :** Check the video for v i o l a t i o n s of real - world i n t u i t i o n and ph ys ic al laws . * ** N e w t o n i a n Physics & K i n e m a t i c s :** Does motion adhere to gravity , momentum , and inertia ? ( e . g . , check for " f lo at in g " footsteps , u n n a t u r a l acceleration...
-
[22]
** C re at iv e S tr at eg y E ff ic ac y (0 -100) :** Based on ** Laskey , Day , and Crask ’ s (1989) ** ty po lo gy of cr ea ti ve s t r a t e g i e s : * Did the chosen s tr at eg y ( Informational , T ra ns fo rm at io na l , or C o m p a r a t i v e ) s u c c e s s f u l l y c o m m u n i c a t e the value p r o p o s i t i o n ? * If ** I n f o r m ...
1989
-
[23]
n a r r a t i v e t r a n s p o r t a t i o n
** N a r r a t i v e Mode E ff ic ac y (0 -100) :** Based on ** Escalas ’ (2004) ** theory of n a r r a t i v e p r o c e s s i n g and ** Green & Brock ’ s (2000) ** concept of " n a r r a t i v e t r a n s p o r t a t i o n ": * Did the s t r u c t u r e ( Analytical , Vignette , or N a r r a t i v e Drama ) e f f e c t i v e l y t r a n s p o r t the v...
2004
-
[24]
u n n a t u r a l
** A e s t h e t i c A r c h e t y p e Ef fi ca cy (0 -100) :** Based on ** Zettl ’ s (2016) ** Applied Media A e s t h e t i c s and ** Lang ’ s (2000) ** limited c ap ac it y model of message 33 Co-Director: Agentic Generative Video Storytelling p r o c e s s i n g : * Did the visual and au di to ry choices ( lighting , motion , audio ) align with the b...
2016
-
[25]
** ‘ feedback ‘**: A sharp , c ri ti ca l review citing sp ec if ic t i m e s t a m p s or frames where errors occur
-
[26]
** ‘ primary_fault ‘**: The stage most r e s p o n s i b l e for the errors ( ‘ ’ storyline ’ ‘ , ‘’ image ’ ‘ , or ‘’ video ’ ‘)
-
[27]
b r e a k d o w n
** ‘ a c t i o n a b l e _ f e e d b a c k ‘**: A direct i n s t r u c t i o n to fix the sp eci fi c error . Write this as a command to the AI agent r e s p o n s i b l e for that stage . ## OUTPUT FORMAT You ** MUST ** format your r es po ns e as a single JSON object . Do not include any m ar kd ow n f o r m a t t i n g ( like ‘‘‘ json ) or c o n v e r ...
-
[28]
[ Image 1]: The r e f e r e n c e Brand Logo
-
[29]
[ Image 2]: The r e f e r e n c e Product Image
-
[30]
[ Text C o n s t r a i n t s ]: A six - point prompt de fi ni ng the Brand , Product , Target Gender , Target Age , Target Location , and Target I nt er es t
-
[31]
You must ev al ua te the video across four d is ti nc t d i m e n s i o n s and output a score from 0 to 100 for each
[ G e n e r a t e d Video ]: The final AI - g e n e r a t e d video a d v e r t i s e m e n t . You must ev al ua te the video across four d is ti nc t d i m e n s i o n s and output a score from 0 to 100 for each . ### E V A L U A T I O N RUBRIC 34 Co-Director: Agentic Generative Video Storytelling **1. Visual Asset Fi de li ty ( VAF ) ** M ea su res the...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.