MAVEN A Multi-Agent Framework for Multicultural Text-to-Video Generation

Oana Ignat; Parth Bhalerao; Shuowei Li; Yuming Zhao

arxiv: 2605.16716 · v1 · pith:VQBYAW5Anew · submitted 2026-05-16 · 💻 cs.CV · cs.AI

MAVEN A Multi-Agent Framework for Multicultural Text-to-Video Generation

Shuowei Li , Yuming Zhao , Parth Bhalerao , Oana Ignat This is my paper

Pith reviewed 2026-05-19 21:55 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords text-to-video generationmulti-agent refinementcultural fidelityprompt decompositionmulticultural videovideo quality evaluationCLIP-based metrics

0 comments

The pith

Multi-agent refinement of text prompts raises cultural fidelity in generated videos.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a system that splits text prompts into separate parts covering the people, actions, and settings in a scene, then assigns each part to its own agent for targeted refinement. These agents operate either at the same time or one after the other to insert culturally specific details for cultures such as Chinese, American, and Romanian. The authors test the approach on both single-culture and mixed-culture prompts and report higher scores for cultural match on automated metrics and visual language model judgments. The refinements leave visual quality and motion smoothness largely unchanged. A new set of 243 prompts and 972 videos serves as a shared test collection for measuring cultural performance in text-to-video models.

Core claim

MAVEN decomposes input prompts into person, action, and location dimensions, each handled by a dedicated agent that refines the description for greater cultural accuracy. These refinements can occur in parallel across agents or in sequence. When applied to text-to-video models, the resulting videos demonstrate improved cultural relevance according to automated metrics and visual language model judgments, without compromising measures of visual quality or temporal consistency, in both mono-cultural and cross-cultural prompt settings.

What carries the argument

Parallel specialization of agents, where separate agents focus on refining distinct dimensions of the prompt to enhance cultural specificity.

If this is right

Parallel processing by specialized agents outperforms sequential refinement in cultural relevance scores.
The method applies equally to prompts involving one culture or multiple cultures mixed together.
A dedicated benchmark dataset enables consistent measurement of cultural performance across different generation approaches.
Visual quality and motion consistency remain comparable to unrefined generations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Applying similar agent-based decomposition could help address cultural biases in other AI generation tasks beyond video.
Integrating this refinement step into the core training of video models might eliminate the need for separate post-processing.
Expanding the benchmark to additional cultures would test whether the improvements hold more broadly or reveal limitations in certain contexts.

Load-bearing premise

That assigning prompt aspects to specialized agents will consistently enhance cultural representation without introducing undetected inconsistencies or new forms of bias in the outputs.

What would settle it

A direct comparison where videos from the multi-agent method receive lower cultural relevance ratings from human viewers or automated judges than those from a basic single-prompt approach would disprove the central improvement claim.

Figures

Figures reproduced from arXiv: 2605.16716 by Oana Ignat, Parth Bhalerao, Shuowei Li, Yuming Zhao.

**Figure 2.** Figure 2: CRS and dimension-specific scores (OCRS, [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Alignment scores for all four pipelines with [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: VLM-judged cultural relevance scores (scored [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Visual Quality vs. Temporal Consistency for [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Qualitative comparison for a mono-cultural example (“a Chinese person playing guzheng at the Potala [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗

read the original abstract

Text-to-video (T2V) generation has rapidly progressed in visual fidelity, yet its ability to faithfully represent multiple cultures within a single prompt remains underexplored. We introduce MAVEN, a multi-agent prompt refinement framework designed to improve cultural fidelity in both mono-cultural and cross-cultural T2V generation. MAVEN decomposes prompts into person, action, and location dimensions, handled by specialized agents operating in parallel or sequentially. To support systematic evaluation, we contribute a new benchmark of 243 culturally grounded prompts and 972 corresponding videos, spanning three cultures (Chinese, American, Romanian), three action categories, and both mono-cultural and cross-cultural scenarios. Evaluations combining CLIP-based metrics, VLM-as-judge assessments, and videoquality measures show that multi-agent refinement, particularly parallel specialization, significantly improves cultural relevance while preserving visual quality and temporal consistency. The dataset and code are available athttps://github.com/AIM-SCU/CRAFT

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces MAVEN, a multi-agent prompt refinement framework for improving cultural fidelity in text-to-video (T2V) generation. Prompts are decomposed into person, action, and location dimensions, each assigned to specialized agents that operate either in parallel or sequentially. A new benchmark of 243 culturally grounded prompts and 972 videos is contributed, covering Chinese, American, and Romanian cultures across mono-cultural and cross-cultural scenarios. Evaluations using CLIP-based metrics, VLM-as-judge ratings, and standard video quality measures report that parallel multi-agent refinement yields significant gains in cultural relevance while preserving visual quality and temporal consistency. The dataset and code are released publicly.

Significance. If the reported improvements prove robust, MAVEN would provide a practical, modular approach to multicultural T2V generation together with a reusable benchmark, addressing a timely gap in generative models. The open release of data and code supports reproducibility and follow-on work. However, the strength of the central claim depends on whether the chosen automatic metrics reliably capture genuine cultural fidelity gains rather than artifacts.

major comments (2)

[Evaluation] Evaluation section: The headline result that parallel specialization improves cultural relevance without harming quality is supported only by CLIP scores, VLM-as-judge ratings, and generic video-quality metrics. These metrics are known to be insensitive to subtle cross-cultural mismatches (incorrect symbolic objects, gesture norms, location-specific details) and can embed training-data biases against less-represented cultures such as Romanian. Without a human-grounded validation set or more targeted cultural metrics, the central claim remains under-supported.
[Benchmark] Benchmark section: The 243-prompt / 972-video benchmark is described at a high level, but the manuscript does not report how cultural grounding was verified or whether inter-annotator agreement was measured. If the prompts themselves contain ambiguities or if agent refinements introduce new inconsistencies that the automatic judges systematically overlook, the reported gains could be artifacts of the evaluation protocol rather than genuine fidelity improvements.

minor comments (2)

[Abstract] The abstract contains the concatenated phrase 'videoquality measures'; insert a space for readability.
[Method] Clarify in the method description whether the parallel agents share any intermediate state or operate completely independently; the current wording leaves this ambiguous.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major point below and indicate the revisions incorporated into the next version of the paper.

read point-by-point responses

Referee: [Evaluation] Evaluation section: The headline result that parallel specialization improves cultural relevance without harming quality is supported only by CLIP scores, VLM-as-judge ratings, and generic video-quality metrics. These metrics are known to be insensitive to subtle cross-cultural mismatches (incorrect symbolic objects, gesture norms, location-specific details) and can embed training-data biases against less-represented cultures such as Romanian. Without a human-grounded validation set or more targeted cultural metrics, the central claim remains under-supported.

Authors: We agree that automatic metrics have well-documented limitations for subtle cultural details and may reflect training-data biases. In the revised manuscript we have added a human evaluation study with native annotators from each culture (Chinese, American, Romanian) who rated cultural fidelity on a subset of 150 videos. The study shows statistically significant preference for the parallel multi-agent outputs, with results now reported alongside the automatic metrics in the Evaluation section. We have also expanded the discussion of metric limitations and potential biases. revision: yes
Referee: [Benchmark] Benchmark section: The 243-prompt / 972-video benchmark is described at a high level, but the manuscript does not report how cultural grounding was verified or whether inter-annotator agreement was measured. If the prompts themselves contain ambiguities or if agent refinements introduce new inconsistencies that the automatic judges systematically overlook, the reported gains could be artifacts of the evaluation protocol rather than genuine fidelity improvements.

Authors: We have revised the Benchmark section to describe the verification process: each prompt was independently reviewed by two native speakers or cultural experts per culture, with disagreements resolved by consensus. We now report inter-annotator agreement using Fleiss' kappa (0.83), indicating substantial agreement. These details address concerns about prompt quality and evaluation artifacts. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical claims rest on new benchmark and external metrics

full rationale

The paper introduces MAVEN as a multi-agent prompt decomposition framework (person/action/location agents) and supports its claims with a newly contributed benchmark of 243 prompts plus evaluations on CLIP-based metrics, VLM-as-judge ratings, and standard video-quality measures. No equations, fitted parameters, or self-citations appear in the provided text that would reduce the reported improvements to inputs by construction. The central result—that parallel specialization improves cultural relevance—is presented as an empirical outcome on an external benchmark rather than a definitional or self-referential prediction, rendering the derivation self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the untested premise that prompt decomposition into three fixed dimensions plus parallel agent refinement produces measurable cultural gains; no free parameters or invented physical entities are mentioned.

axioms (1)

domain assumption Specialized agents operating on person/action/location dimensions will improve cultural fidelity without degrading temporal consistency or visual quality.
This premise is invoked when the abstract states that parallel specialization significantly improves cultural relevance while preserving quality.

pith-pipeline@v0.9.0 · 5702 in / 1214 out tokens · 45390 ms · 2026-05-19T21:55:16.376385+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

21 extracted references · 21 canonical work pages · 2 internal anchors

[1]

SimCityNet: Quanti- fying Geo-Cultural bias in AI-generated urban videos through interpretable scene embeddings. SSRN. Ac- cessed 2025-10-10. Yubin Chen, Xuyang Guo, Zhenmei Shi, Zhao Song, and Jiahao Zhang

work page 2025
[2]

Google DeepMind

T2vworldbench: A bench- mark for evaluating world knowledge in text-to-video generation.CoRR, abs/2507.18107. Google DeepMind

work page arXiv
[3]

The Llama 3 Herd of Models

The llama 3 herd of mod- els.CoRR, abs/2407.21783. Panwen Hu, Jin Jiang, Jianqi Chen, Mingfei Han, Shengcai Liao, Xiaojun Chang, and Xiaodan Liang

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Storyagent: Customized storytelling video generation via multi-agent collaboration.CoRR, abs/2411.04925. Ziqi Huang, Fan Zhang, Xiaojie Xu, Yinan He, Jiashuo Yu, Ziyue Dong, Qianli Ma, Nattapol Chanpaisit, Chenyang Si, Yuming Jiang, Yaohui Wang, Xinyuan Chen, Ying-Cong Chen, Limin Wang, Dahua Lin, Yu Qiao, and Ziwei Liu

work page arXiv
[5]

Vbench++: Comprehensive and versatile bench- mark suite for video generative models.arXiv preprint arXiv:2411.13503, 2024

Vbench++: Com- prehensive and versatile benchmark suite for video generative models.CoRR, abs/2411.13503. Nithish Kannen, Arif Ahmad, Marco Andreetto, Vinod- kumar Prabhakaran, Utsav Prabhu, Adji Bousso Di- eng, Pushpak Bhattacharyya, and Shachi Dave

work page arXiv
[6]

InAdvances in Neural Information Processing Systems 38: Annual Conference on Neu- ral Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, December 10 - 15,

Beyond aesthetics: Cultural competence in text-to- image models. InAdvances in Neural Information Processing Systems 38: Annual Conference on Neu- ral Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, December 10 - 15,

work page 2024
[7]

Towards automatic evaluation for image transcreation. InProceedings of the 2025 Con- ference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL 2025 - Volume 1: Long Papers, Albuquerque, New Mexico, USA, April 29 - May 4, 2025, pages 7034–7047. Association for Computational Linguisti...

work page 2025
[8]

Shudong Liu, Yiqiao Jin, Cheng Li, Derek F

CULTURE-GEN: revealing global cultural percep- tion in language models through natural language prompting.CoRR, abs/2404.10199. Shudong Liu, Yiqiao Jin, Cheng Li, Derek F. Wong, Qingsong Wen, Lichao Sun, Haipeng Chen, Xing Xie, and Jindong Wang

work page arXiv
[9]

Culturevlm: Characterizing and improving cultural understanding of vision-language models for over 100 countries,

Culturevlm: Char- acterizing and improving cultural understanding of vision-language models for over 100 countries. CoRR, abs/2501.01282. Yaofang Liu, Xiaodong Cun, Xuebo Liu, Xintao Wang, Yong Zhang, Haoxin Chen, Yang Liu, Tieyong Zeng, Raymond Chan, and Ying Shan

work page arXiv
[10]

InIEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, CVPR 2024, Seattle, WA, USA, June 16-22, 2024, pages 22139–22149

Evalcrafter: Benchmarking and evaluating large video generation models. InIEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, CVPR 2024, Seattle, WA, USA, June 16-22, 2024, pages 22139–22149. IEEE. OpenAI

work page 2024
[11]

Accessed 2025-10-

Sora system card. Accessed 2025-10-

work page 2025
[12]

Learn- ing transferable visual models from natural language supervision. InProceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 ofProceedings of Machine Learning Research, pages 8748–8763. PMLR. Kaiyue Sun, Kaiyi Huang, Xian Liu, Yue Wu, Zihan Xu, Zhenguo Li, and Xihui Liu

work page 2021
[13]

InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2025, Nashville, TN, USA, June 11-15, 2025, pages 8406–8416

T2v-compbench: A comprehensive benchmark for compositional text- to-video generation. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2025, Nashville, TN, USA, June 11-15, 2025, pages 8406–8416. Computer Vision Foundation / IEEE. Rui Sun, Yumin Zhang, Tejal Shah, Jiahao Sun, Shuoy- ing Zhang, Wenqi Li, Haoran Duan, Bo Wei, and Rajiv Ranjan

work page 2025
[14]

Qian Wang, Ziqi Huang, Ruoxi Jia, Paul Debevec, and Ning Yu

From sora what we can see: A survey of text-to-video generation.CoRR, abs/2405.10674. Qian Wang, Ziqi Huang, Ruoxi Jia, Paul Debevec, and Ning Yu

work page arXiv
[15]

Yi Wang, Yinan He, Yizhuo Li, Kunchang Li, Jiashuo Yu, Xin Ma, Xinhao Li, Guo Chen, Xinyuan Chen, Yaohui Wang, Ping Luo, Ziwei Liu, Yali Wang, Limin Wang, and Yu Qiao

Mavis: A multi-agent frame- work for long-sequence video storytelling.CoRR, abs/2508.08487. Yi Wang, Yinan He, Yizhuo Li, Kunchang Li, Jiashuo Yu, Xin Ma, Xinhao Li, Guo Chen, Xinyuan Chen, Yaohui Wang, Ping Luo, Ziwei Liu, Yali Wang, Limin Wang, and Yu Qiao

work page arXiv
[16]

InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11,

Internvid: A large-scale video-text dataset for multimodal understanding and generation. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11,

work page 2024
[17]

AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation

Autogen: Enabling next-gen llm applications via multi-agent conversation.CoRR, abs/2308.08155. Weijia Wu, Zeyu Zhu, and Mike Zheng Shou

work page internal anchor Pith review Pith/arXiv arXiv
[18]

Automated movie generation via multi-agent cot plan- ning.ArXiv, abs/2503.07314,

Au- tomated movie generation via multi-agent cot plan- ning.CoRR, abs/2503.07314. Zhifei Xie, Daniel Tang, Dingwei Tan, Jacques Klein, Tegawend F. Bissyand, and Saad Ezzini

work page arXiv
[19]

Dreamfactory: Pioneering multi-scene long video generation with a multi-agent framework.CoRR, abs/2408.11788. Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, Da Yin, Yux- uan Zhang, Weihan Wang, Yean Cheng, Bin Xu, Xiaotao Gu, Yuxiao Dong, and Jie Tang

work page arXiv
[20]

InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28,

Cogvideox: Text-to-video diffusion models with an expert transformer. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28,

work page 2025
[21]

Mora: Enabling generalist video generation via a multi-agent framework.arXiv preprint arXiv:2403.13248, 2024

Mora: Enabling generalist video generation via A multi-agent framework.CoRR, abs/2403.13248. A Compute and Runtime Experiments are conducted on NVIDIA H100 GPUs. Generating a single 5-second video takes approximately 3 minutes. SA and MAP pipelines require one additional minute for prompt refine- ment, while MAS requires approximately 2 addi- tional minut...

work page arXiv

[1] [1]

SimCityNet: Quanti- fying Geo-Cultural bias in AI-generated urban videos through interpretable scene embeddings. SSRN. Ac- cessed 2025-10-10. Yubin Chen, Xuyang Guo, Zhenmei Shi, Zhao Song, and Jiahao Zhang

work page 2025

[2] [2]

Google DeepMind

T2vworldbench: A bench- mark for evaluating world knowledge in text-to-video generation.CoRR, abs/2507.18107. Google DeepMind

work page arXiv

[3] [3]

The Llama 3 Herd of Models

The llama 3 herd of mod- els.CoRR, abs/2407.21783. Panwen Hu, Jin Jiang, Jianqi Chen, Mingfei Han, Shengcai Liao, Xiaojun Chang, and Xiaodan Liang

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

Storyagent: Customized storytelling video generation via multi-agent collaboration.CoRR, abs/2411.04925. Ziqi Huang, Fan Zhang, Xiaojie Xu, Yinan He, Jiashuo Yu, Ziyue Dong, Qianli Ma, Nattapol Chanpaisit, Chenyang Si, Yuming Jiang, Yaohui Wang, Xinyuan Chen, Ying-Cong Chen, Limin Wang, Dahua Lin, Yu Qiao, and Ziwei Liu

work page arXiv

[5] [5]

Vbench++: Comprehensive and versatile bench- mark suite for video generative models.arXiv preprint arXiv:2411.13503, 2024

Vbench++: Com- prehensive and versatile benchmark suite for video generative models.CoRR, abs/2411.13503. Nithish Kannen, Arif Ahmad, Marco Andreetto, Vinod- kumar Prabhakaran, Utsav Prabhu, Adji Bousso Di- eng, Pushpak Bhattacharyya, and Shachi Dave

work page arXiv

[6] [6]

InAdvances in Neural Information Processing Systems 38: Annual Conference on Neu- ral Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, December 10 - 15,

Beyond aesthetics: Cultural competence in text-to- image models. InAdvances in Neural Information Processing Systems 38: Annual Conference on Neu- ral Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, December 10 - 15,

work page 2024

[7] [7]

Towards automatic evaluation for image transcreation. InProceedings of the 2025 Con- ference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL 2025 - Volume 1: Long Papers, Albuquerque, New Mexico, USA, April 29 - May 4, 2025, pages 7034–7047. Association for Computational Linguisti...

work page 2025

[8] [8]

Shudong Liu, Yiqiao Jin, Cheng Li, Derek F

CULTURE-GEN: revealing global cultural percep- tion in language models through natural language prompting.CoRR, abs/2404.10199. Shudong Liu, Yiqiao Jin, Cheng Li, Derek F. Wong, Qingsong Wen, Lichao Sun, Haipeng Chen, Xing Xie, and Jindong Wang

work page arXiv

[9] [9]

Culturevlm: Characterizing and improving cultural understanding of vision-language models for over 100 countries,

Culturevlm: Char- acterizing and improving cultural understanding of vision-language models for over 100 countries. CoRR, abs/2501.01282. Yaofang Liu, Xiaodong Cun, Xuebo Liu, Xintao Wang, Yong Zhang, Haoxin Chen, Yang Liu, Tieyong Zeng, Raymond Chan, and Ying Shan

work page arXiv

[10] [10]

InIEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, CVPR 2024, Seattle, WA, USA, June 16-22, 2024, pages 22139–22149

Evalcrafter: Benchmarking and evaluating large video generation models. InIEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, CVPR 2024, Seattle, WA, USA, June 16-22, 2024, pages 22139–22149. IEEE. OpenAI

work page 2024

[11] [11]

Accessed 2025-10-

Sora system card. Accessed 2025-10-

work page 2025

[12] [12]

Learn- ing transferable visual models from natural language supervision. InProceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 ofProceedings of Machine Learning Research, pages 8748–8763. PMLR. Kaiyue Sun, Kaiyi Huang, Xian Liu, Yue Wu, Zihan Xu, Zhenguo Li, and Xihui Liu

work page 2021

[13] [13]

InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2025, Nashville, TN, USA, June 11-15, 2025, pages 8406–8416

T2v-compbench: A comprehensive benchmark for compositional text- to-video generation. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2025, Nashville, TN, USA, June 11-15, 2025, pages 8406–8416. Computer Vision Foundation / IEEE. Rui Sun, Yumin Zhang, Tejal Shah, Jiahao Sun, Shuoy- ing Zhang, Wenqi Li, Haoran Duan, Bo Wei, and Rajiv Ranjan

work page 2025

[14] [14]

Qian Wang, Ziqi Huang, Ruoxi Jia, Paul Debevec, and Ning Yu

From sora what we can see: A survey of text-to-video generation.CoRR, abs/2405.10674. Qian Wang, Ziqi Huang, Ruoxi Jia, Paul Debevec, and Ning Yu

work page arXiv

[15] [15]

Yi Wang, Yinan He, Yizhuo Li, Kunchang Li, Jiashuo Yu, Xin Ma, Xinhao Li, Guo Chen, Xinyuan Chen, Yaohui Wang, Ping Luo, Ziwei Liu, Yali Wang, Limin Wang, and Yu Qiao

Mavis: A multi-agent frame- work for long-sequence video storytelling.CoRR, abs/2508.08487. Yi Wang, Yinan He, Yizhuo Li, Kunchang Li, Jiashuo Yu, Xin Ma, Xinhao Li, Guo Chen, Xinyuan Chen, Yaohui Wang, Ping Luo, Ziwei Liu, Yali Wang, Limin Wang, and Yu Qiao

work page arXiv

[16] [16]

InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11,

Internvid: A large-scale video-text dataset for multimodal understanding and generation. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11,

work page 2024

[17] [17]

AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation

Autogen: Enabling next-gen llm applications via multi-agent conversation.CoRR, abs/2308.08155. Weijia Wu, Zeyu Zhu, and Mike Zheng Shou

work page internal anchor Pith review Pith/arXiv arXiv

[18] [18]

Automated movie generation via multi-agent cot plan- ning.ArXiv, abs/2503.07314,

Au- tomated movie generation via multi-agent cot plan- ning.CoRR, abs/2503.07314. Zhifei Xie, Daniel Tang, Dingwei Tan, Jacques Klein, Tegawend F. Bissyand, and Saad Ezzini

work page arXiv

[19] [19]

Dreamfactory: Pioneering multi-scene long video generation with a multi-agent framework.CoRR, abs/2408.11788. Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, Da Yin, Yux- uan Zhang, Weihan Wang, Yean Cheng, Bin Xu, Xiaotao Gu, Yuxiao Dong, and Jie Tang

work page arXiv

[20] [20]

InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28,

Cogvideox: Text-to-video diffusion models with an expert transformer. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28,

work page 2025

[21] [21]

Mora: Enabling generalist video generation via a multi-agent framework.arXiv preprint arXiv:2403.13248, 2024

Mora: Enabling generalist video generation via A multi-agent framework.CoRR, abs/2403.13248. A Compute and Runtime Experiments are conducted on NVIDIA H100 GPUs. Generating a single 5-second video takes approximately 3 minutes. SA and MAP pipelines require one additional minute for prompt refine- ment, while MAS requires approximately 2 addi- tional minut...

work page arXiv