Bridging Creative Intent and Visual Quality: Creator-Driven Recurrent Video Generation with Agentic Feedback Loops

Aiden Lei; Alexander Liu; Denis Savytski; Heding Liu; Sihan Liang; Warren Yang; Zhe Zhao

arxiv: 2606.18591 · v1 · pith:U43L7Q7Mnew · submitted 2026-06-17 · 💻 cs.CV

Bridging Creative Intent and Visual Quality: Creator-Driven Recurrent Video Generation with Agentic Feedback Loops

Denis Savytski , Aiden Lei , Heding Liu , Warren Yang , Sihan Liang , Alexander Liu , Zhe Zhao This is my paper

Pith reviewed 2026-06-26 21:59 UTC · model grok-4.3

classification 💻 cs.CV

keywords video generationhuman-AI co-creationfeedback loopsmultimodal LLMsiterative refinementnarrative coherencecreative direction

0 comments

The pith

CHIEF lets creators drive iterative AI video generation by supplying persona-conditioned LLM critiques that self-evaluation cannot provide.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces CHIEF, a framework that keeps the human creator in charge of each refinement step for AI-generated videos while supplying automatic subjective feedback. Creators steer creative direction; a refiner agent folds their changes into the next version. Persona-conditioned multimodal LLMs watch the output and issue audience-perspective critiques on plot, scenes, and narrative flow. The system was tested by having students with no filmmaking background produce videos ranging from one minute to a full ten-minute short film. The central goal is to restore narrative coherence and creative intent that current video generators lose at longer durations.

Core claim

CHIEF enables recurrent video generation in which the creator directs each iteration and persona-conditioned multimodal LLMs supply subjective critiques from audience perspectives, yielding better narrative coherence and creative direction than self-evaluation by the generation model alone.

What carries the argument

The CHIEF framework of creator-driven iterations supported by a specialized refiner agent and persona-conditioned multimodal LLM feedback loops.

If this is right

Creators can steer long-form video output through repeated human-in-the-loop cycles without needing filmmaking expertise.
Feedback generated from audience-simulating personas surfaces plot and scene problems that internal model evaluation overlooks.
The method supports production of complete short films with complicated plots up to ten minutes long.
Each iteration incorporates explicit creator revisions through the refiner agent rather than relying solely on automatic improvement.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same persona-feedback structure could be adapted to other generative tasks such as script writing or interactive story branching.
Results may vary with the choice and diversity of personas used to condition the feedback LLMs.
Adding the feedback step earlier in the generation pipeline rather than only after full video output could reduce total iteration count.

Load-bearing premise

Persona-conditioned multimodal LLMs can reliably produce subjective audience critiques that improve narrative coherence and creative direction beyond what the video model can achieve by self-evaluation.

What would settle it

Run a controlled comparison in which the same starting prompts generate videos refined only by model self-evaluation versus videos refined with CHIEF's persona-LLM feedback, then have independent viewers rate narrative coherence and creative fit.

Figures

Figures reproduced from arXiv: 2606.18591 by Aiden Lei, Alexander Liu, Denis Savytski, Heding Liu, Sihan Liang, Warren Yang, Zhe Zhao.

**Figure 1.** Figure 1: Overview of CHIEF. The creator provides the initial script and remains responsible for creative direction, while the system translates the script into keyframes, clips, and music prompts. Generated artifacts are evaluated by Feedback Agents representing the audience. The Feedback Translator turns this feedback into actionable issues, and the Refiner updates prompts for the next iteration. This loop allows … view at source ↗

**Figure 2.** Figure 2: Representative keyframe improvements for Autonomous Refinement with Creator Monitoring. Each row shows a selected keyframe from two example videos lasting 1 minute, with omitted intermediate iterations indicated by ellipses. The left block presents the Interview example of a man commuting to an interview, where refinement increases rush-hour crowd density, removes an unintended flashlight artifact, and rem… view at source ↗

**Figure 3.** Figure 3: Keyframe progression across iterations of the Creator-Driven Film Generation Case Study. The figure shows three keyframes (Core, Drone, and Chipping) across the baseline and three creator-gated iterations. Each row shows substantial creative refinement: Core shifts from generic exposition to tense infiltration; Drone moves from action staging to a grounded emotional moment; Chipping evolves from ambiguous … view at source ↗

read the original abstract

Generative AI has made content creation increasingly accessible, but many AI-generated videos lack narrative coherence and creative direction, issues that become more substantial at longer durations. Unlike coding, where AI generation benefits from reliable feedback and techniques such as recurrent self-improvement, video generation requires subjective feedback about plot, scenes, and narrative, which naturally motivates approaches that incorporate human creative direction. We introduce CHIEF, a human-AI co-creation video generation framework that places the creator at the center of human-in-the-loop iterative video refinement, and supports them by providing automatic subjective feedback. The creator incorporates their creative direction by driving each iteration, while their revisions are incorporated by a specialized refiner agent. The feedback loop is generated by persona-conditioned multimodal LLMs that watch generated videos and produce subjective critique from the audience perspectives, providing feedback that self-evaluation alone cannot capture. To test the effectiveness of our proposed framework, we work with high school and college students with no prior filmmaking experience to create videos, from short 1-minute videos to a complete short 10-minute film with a complicated plot.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CHIEF describes a human-in-the-loop video framework using persona-conditioned MLLM agents for subjective feedback, but supplies no metrics, baselines, or outcomes to show the loop actually helps.

read the letter

The paper's core idea is a recurrent generation setup where a human creator steers each iteration of a video and an LLM agent supplies critique from simulated audience personas. This targets the real problem that current video models drift on plot and tone over longer clips.

What stands out is the explicit separation of creator direction from automated subjective feedback. The persona conditioning on the critic agents is a concrete design choice that tries to move beyond generic self-evaluation. The user study with inexperienced students making 1- to 10-minute films is a reasonable test population for accessibility claims.

The main weakness is the complete absence of any reported results. The abstract mentions the study but gives no coherence scores, preference data, iteration counts, or comparisons against direct human feedback or non-persona baselines. Without those numbers the claim that the agentic loop improves narrative coherence stays untested.

The methods section would need to show how the refiner agent actually incorporates the critiques and whether the personas produce distinguishable feedback. Citation coverage of prior human-in-the-loop video work also looks thin from the abstract alone.

This is for readers already building generative media pipelines who want a system sketch. It does not yet contain enough evidence to justify a full referee process.

Referee Report

2 major / 0 minor

Summary. The paper introduces CHIEF, a human-AI co-creation framework for recurrent video generation that centers the creator in iterative refinement loops. Creator intent drives each iteration while a refiner agent incorporates revisions; persona-conditioned multimodal LLMs generate automatic subjective critiques from audience perspectives to address narrative coherence and creative direction issues that self-evaluation cannot capture. Effectiveness is tested via a user study in which high-school and college students with no filmmaking experience produce videos ranging from 1-minute clips to a complete 10-minute short film with a complicated plot.

Significance. If the framework's agentic feedback loop can be shown to produce measurable gains in narrative coherence and creative direction beyond direct human input or model self-evaluation, the work would offer a concrete advance in human-in-the-loop generative video systems. The emphasis on persona-conditioned MLLM critiques as a scalable substitute for audience feedback is a distinctive technical contribution that could influence future co-creation pipelines.

major comments (2)

[Abstract] Abstract: the description of the user study with inexperienced students states that videos from 1 to 10 minutes were produced, yet reports no quantitative metrics (coherence scores, A/B preference rates, iteration-gain statistics), no baseline comparisons (e.g., against non-LLM feedback or self-refinement), and no outcome data. This absence leaves the central claim that the persona-conditioned MLLM loop improves narrative coherence unsupported.
[Abstract] Abstract: the value proposition rests on the assertion that persona-conditioned MLLM critiques supply feedback 'that self-evaluation alone cannot capture' and thereby enhance creative direction. No ablation, inter-rater reliability measure, or controlled comparison is described that would allow attribution of any observed improvements to the agentic loop rather than to the human creator's direct revisions.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract and the evidence supporting our claims. We agree that the abstract would benefit from clearer indication of study outcomes and will revise it accordingly while preserving the manuscript's focus on the CHIEF framework. Point-by-point responses follow.

read point-by-point responses

Referee: [Abstract] Abstract: the description of the user study with inexperienced students states that videos from 1 to 10 minutes were produced, yet reports no quantitative metrics (coherence scores, A/B preference rates, iteration-gain statistics), no baseline comparisons (e.g., against non-LLM feedback or self-refinement), and no outcome data. This absence leaves the central claim that the persona-conditioned MLLM loop improves narrative coherence unsupported.

Authors: We agree that the abstract does not report quantitative metrics, baseline comparisons, or outcome statistics. The full manuscript presents the user study as a feasibility demonstration with qualitative descriptions of the videos produced (including the 10-minute film), but does not include numerical coherence scores or controlled baselines. We will revise the abstract to summarize key study outcomes and add quantitative metrics plus baseline comparisons (e.g., against self-refinement) to the methods/results sections in the revised manuscript. revision: yes
Referee: [Abstract] Abstract: the value proposition rests on the assertion that persona-conditioned MLLM critiques supply feedback 'that self-evaluation alone cannot capture' and thereby enhance creative direction. No ablation, inter-rater reliability measure, or controlled comparison is described that would allow attribution of any observed improvements to the agentic loop rather than to the human creator's direct revisions.

Authors: The framework's design uses persona-conditioned MLLMs specifically to generate audience-perspective critiques unavailable to model self-evaluation. We acknowledge the absence of explicit ablations or inter-rater measures in the current version. We will revise the manuscript to include a clearer discussion of this distinction and add any available inter-rater analysis or limited comparisons from the existing user study data to support attribution of improvements to the agentic loop. revision: yes

Circularity Check

0 steps flagged

No circularity; framework description without equations or self-referential predictions

full rationale

The paper introduces the CHIEF framework for human-AI video co-creation with persona-conditioned MLLM feedback loops. The abstract and description contain no equations, fitted parameters, predictions of derived quantities, or mathematical derivations. Claims rest on framework design and an informal user study with students; no load-bearing self-citations, ansatzes, or reductions of outputs to inputs by construction are present. The work is self-contained as a conceptual and empirical proposal.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no information on free parameters, axioms, or invented entities used in the framework.

pith-pipeline@v0.9.1-grok · 5736 in / 1145 out tokens · 33585 ms · 2026-06-26T21:59:44.917620+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

18 extracted references · 4 linked inside Pith

[1]

org/CorpusID:252280474

URL https://api.semanticscholar. org/CorpusID:252280474. Buz, T., Frost, B., Genchev, N., Schneider, M., Kaf- fee, L.-A., and de Melo, G. Investigating wit, creativity, and detectability of large language mod- els in domain-specific writing style adaptation of reddit’s showerthoughts.ArXiv, abs/2405.01660,

arXiv
[2]

org/CorpusID:269588087

URL https://api.semanticscholar. org/CorpusID:269588087. Cheng, J., Lyu, R., Gu, X., Liu, X., Xu, J., Lu, Y ., Teng, J., Yang, Z., Dong, Y ., Tang, J., Wang, H., and Huang, M. Vpo: Aligning text-to-video generation mod- els with prompt optimization.ArXiv, abs/2503.20491,

arXiv
[3]

org/CorpusID:277321582

URL https://api.semanticscholar. org/CorpusID:277321582. Hao, Y ., Chi, Z., Dong, L., and Wei, F. Optimizing prompts for text-to-image generation.ArXiv, abs/2212.09611,

arXiv
[4]

org/CorpusID:254853701

URL https://api.semanticscholar. org/CorpusID:254853701. Huang, Z., He, Y ., Yu, J., Zhang, F., Si, C., Jiang, Y ., Zhang, Y ., Wu, T., Jin, Q., Chanpaisit, N., Wang, Y ., Chen, X., Wang, L., Lin, D., Qiao, Y ., and Liu, Z. Vbench: Comprehensive benchmark suite for video generative models.2024 IEEE/CVF Conference on Computer Vi- sion and Pattern Recogniti...

2024
[5]

org/CorpusID:265506207

URL https://api.semanticscholar. org/CorpusID:265506207. Ji, Y ., Zhang, J., Wu, J., Zhang, S., Chen, S., Ge, C., Sun, P., Chen, W., Shao, W., Xiao, X., Huang, W., and Luo, P. Prompt-a-video: Prompt your video diffusion model via preference-aligned llm.ArXiv, abs/2412.15156,

arXiv
[6]

org/CorpusID:274859339

URL https://api.semanticscholar. org/CorpusID:274859339. Jones, C. R. and Bergen, B. K. Large language mod- els pass the turing test.ArXiv, abs/2503.23674,

arXiv
[7]

org/CorpusID:277451766

URL https://api.semanticscholar. org/CorpusID:277451766. Kapwing. Ai slop report: The global rise of low-quality ai videos, 2025. URL https://www.kapwing.com/blog/ ai-slop-report-the-global-rise-of-low-quality-ai-videos/ . Accessed: 2026-04-25. Langley, P. Crafting papers on machine learning. In Langley, P. (ed.),Proceedings of the 17th International Conf...

Pith/arXiv arXiv 2025
[8]

org/CorpusID:275820437

URL https://api.semanticscholar. org/CorpusID:275820437. Long, D. X., Wan, X., Nakhost, H., Lee, C.-Y ., Pfister, T., and Arik, S. ¨O. Vista: A test-time self-improving video generation agent.ArXiv, abs/2510.15831,

arXiv
[9]

org/CorpusID:282203607

URL https://api.semanticscholar. org/CorpusID:282203607. Madaan, A., Tandon, N., Gupta, P., Hallinan, S., Gao, L., Wiegreffe, S., Alon, U., Dziri, N., Prabhumoye, S., Yang, Y ., Welleck, S., Majumder, B. P., Gupta, S., Yaz- danbakhsh, A., and Clark, P. Self-refine: Iterative re- finement with self-feedback.ArXiv, abs/2303.17651,

Pith/arXiv arXiv
[10]

org/CorpusID:257900871

URL https://api.semanticscholar. org/CorpusID:257900871. Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wain- wright, C. L., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., Schulman, J., Hilton, J., Kelton, F., Miller, L. E., Simens, M., Askell, A., Welinder, P., Christiano, P. F., Leike, J., and Lowe, 9 CHIEF: Creator-driven Hybrid Iterative Evalua...

Pith/arXiv arXiv
[11]

org/CorpusID:246426909

URL https://api.semanticscholar. org/CorpusID:246426909. Venkatesh, K., Dunlop, C., and Yanardag, P. Crea: A collab- orative multi-agent framework for creative content gen- eration with diffusion models.ArXiv, abs/2504.05306,

arXiv
[12]

org/CorpusID:277627064

URL https://api.semanticscholar. org/CorpusID:277627064. Wu, W., Zhu, Z., and Shou, M. Z. Automated movie genera- tion via multi-agent cot planning.ArXiv, abs/2503.07314,

arXiv
[13]

org/CorpusID:276929150

URL https://api.semanticscholar. org/CorpusID:276929150. Xie, Z., Tang, D., Tan, D., Klein, J., Bissyand, T. F., and Ezzini, S. Dreamfactory: Pioneering multi-scene long video generation with a multi-agent frame- work.ArXiv, abs/2408.11788, 2024. URL https: //api.semanticscholar.org/CorpusID: 271915831. Xu, J., Huang, Y ., Cheng, J., Yang, Y ., Xu, J., Wa...

arXiv 2024
[14]

org/CorpusID:275133577

URL https://api.semanticscholar. org/CorpusID:275133577. Zeng, Q., Cai, K., Chen, R., Lv, Q., and Wang, K. Co- agent: Collaborative planning and consistency agent for coherent video generation.ArXiv, abs/2512.22536,

arXiv
[15]

org/CorpusID:284311738

URL https://api.semanticscholar. org/CorpusID:284311738. Zheng, D., Huang, Z., Liu, H., Zou, K., He, Y ., Zhang, F., Zhang, Y ., He, J., Zheng, W.-S., Qiao, Y ., and Liu, Z. Vbench-2.0: Advancing video generation benchmark suite for intrinsic faithfulness.ArXiv, abs/2503.21755,

Pith/arXiv arXiv
[16]

simulate

URL https://api.semanticscholar. org/CorpusID:277350030. 10 CHIEF: Creator-driven Hybrid Iterative Evaluation Framework A. Persona Generation Details A.1. Pipeline Overview Persona generation operates in two stages. In the first stage, a sample of 30 comments from a single user is passed through the base persona LLM, which produces a textual persona card ...
[17]

Emotional Expression
[18]

emotional_expression

Representative Quote User comments: {comments} Self-refinement critic. System: You are evaluating how accurately a persona card captures a user’s genuine personality - their values, interests, emotional tendencies, and patterns of thought and expression. Be specific, critical, and actionable. User: Persona card: {persona_card} User’s comments: {comments} ...

[1] [1]

org/CorpusID:252280474

URL https://api.semanticscholar. org/CorpusID:252280474. Buz, T., Frost, B., Genchev, N., Schneider, M., Kaf- fee, L.-A., and de Melo, G. Investigating wit, creativity, and detectability of large language mod- els in domain-specific writing style adaptation of reddit’s showerthoughts.ArXiv, abs/2405.01660,

arXiv

[2] [2]

org/CorpusID:269588087

URL https://api.semanticscholar. org/CorpusID:269588087. Cheng, J., Lyu, R., Gu, X., Liu, X., Xu, J., Lu, Y ., Teng, J., Yang, Z., Dong, Y ., Tang, J., Wang, H., and Huang, M. Vpo: Aligning text-to-video generation mod- els with prompt optimization.ArXiv, abs/2503.20491,

arXiv

[3] [3]

org/CorpusID:277321582

URL https://api.semanticscholar. org/CorpusID:277321582. Hao, Y ., Chi, Z., Dong, L., and Wei, F. Optimizing prompts for text-to-image generation.ArXiv, abs/2212.09611,

arXiv

[4] [4]

org/CorpusID:254853701

URL https://api.semanticscholar. org/CorpusID:254853701. Huang, Z., He, Y ., Yu, J., Zhang, F., Si, C., Jiang, Y ., Zhang, Y ., Wu, T., Jin, Q., Chanpaisit, N., Wang, Y ., Chen, X., Wang, L., Lin, D., Qiao, Y ., and Liu, Z. Vbench: Comprehensive benchmark suite for video generative models.2024 IEEE/CVF Conference on Computer Vi- sion and Pattern Recogniti...

2024

[5] [5]

org/CorpusID:265506207

URL https://api.semanticscholar. org/CorpusID:265506207. Ji, Y ., Zhang, J., Wu, J., Zhang, S., Chen, S., Ge, C., Sun, P., Chen, W., Shao, W., Xiao, X., Huang, W., and Luo, P. Prompt-a-video: Prompt your video diffusion model via preference-aligned llm.ArXiv, abs/2412.15156,

arXiv

[6] [6]

org/CorpusID:274859339

URL https://api.semanticscholar. org/CorpusID:274859339. Jones, C. R. and Bergen, B. K. Large language mod- els pass the turing test.ArXiv, abs/2503.23674,

arXiv

[7] [7]

org/CorpusID:277451766

URL https://api.semanticscholar. org/CorpusID:277451766. Kapwing. Ai slop report: The global rise of low-quality ai videos, 2025. URL https://www.kapwing.com/blog/ ai-slop-report-the-global-rise-of-low-quality-ai-videos/ . Accessed: 2026-04-25. Langley, P. Crafting papers on machine learning. In Langley, P. (ed.),Proceedings of the 17th International Conf...

Pith/arXiv arXiv 2025

[8] [8]

org/CorpusID:275820437

URL https://api.semanticscholar. org/CorpusID:275820437. Long, D. X., Wan, X., Nakhost, H., Lee, C.-Y ., Pfister, T., and Arik, S. ¨O. Vista: A test-time self-improving video generation agent.ArXiv, abs/2510.15831,

arXiv

[9] [9]

org/CorpusID:282203607

URL https://api.semanticscholar. org/CorpusID:282203607. Madaan, A., Tandon, N., Gupta, P., Hallinan, S., Gao, L., Wiegreffe, S., Alon, U., Dziri, N., Prabhumoye, S., Yang, Y ., Welleck, S., Majumder, B. P., Gupta, S., Yaz- danbakhsh, A., and Clark, P. Self-refine: Iterative re- finement with self-feedback.ArXiv, abs/2303.17651,

Pith/arXiv arXiv

[10] [10]

org/CorpusID:257900871

URL https://api.semanticscholar. org/CorpusID:257900871. Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wain- wright, C. L., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., Schulman, J., Hilton, J., Kelton, F., Miller, L. E., Simens, M., Askell, A., Welinder, P., Christiano, P. F., Leike, J., and Lowe, 9 CHIEF: Creator-driven Hybrid Iterative Evalua...

Pith/arXiv arXiv

[11] [11]

org/CorpusID:246426909

URL https://api.semanticscholar. org/CorpusID:246426909. Venkatesh, K., Dunlop, C., and Yanardag, P. Crea: A collab- orative multi-agent framework for creative content gen- eration with diffusion models.ArXiv, abs/2504.05306,

arXiv

[12] [12]

org/CorpusID:277627064

URL https://api.semanticscholar. org/CorpusID:277627064. Wu, W., Zhu, Z., and Shou, M. Z. Automated movie genera- tion via multi-agent cot planning.ArXiv, abs/2503.07314,

arXiv

[13] [13]

org/CorpusID:276929150

URL https://api.semanticscholar. org/CorpusID:276929150. Xie, Z., Tang, D., Tan, D., Klein, J., Bissyand, T. F., and Ezzini, S. Dreamfactory: Pioneering multi-scene long video generation with a multi-agent frame- work.ArXiv, abs/2408.11788, 2024. URL https: //api.semanticscholar.org/CorpusID: 271915831. Xu, J., Huang, Y ., Cheng, J., Yang, Y ., Xu, J., Wa...

arXiv 2024

[14] [14]

org/CorpusID:275133577

URL https://api.semanticscholar. org/CorpusID:275133577. Zeng, Q., Cai, K., Chen, R., Lv, Q., and Wang, K. Co- agent: Collaborative planning and consistency agent for coherent video generation.ArXiv, abs/2512.22536,

arXiv

[15] [15]

org/CorpusID:284311738

URL https://api.semanticscholar. org/CorpusID:284311738. Zheng, D., Huang, Z., Liu, H., Zou, K., He, Y ., Zhang, F., Zhang, Y ., He, J., Zheng, W.-S., Qiao, Y ., and Liu, Z. Vbench-2.0: Advancing video generation benchmark suite for intrinsic faithfulness.ArXiv, abs/2503.21755,

Pith/arXiv arXiv

[16] [16]

simulate

URL https://api.semanticscholar. org/CorpusID:277350030. 10 CHIEF: Creator-driven Hybrid Iterative Evaluation Framework A. Persona Generation Details A.1. Pipeline Overview Persona generation operates in two stages. In the first stage, a sample of 30 comments from a single user is passed through the base persona LLM, which produces a textual persona card ...

[17] [17]

Emotional Expression

[18] [18]

emotional_expression

Representative Quote User comments: {comments} Self-refinement critic. System: You are evaluating how accurately a persona card captures a user’s genuine personality - their values, interests, emotional tendencies, and patterns of thought and expression. Be specific, critical, and actionable. User: Persona card: {persona_card} User’s comments: {comments} ...