pith. sign in

arxiv: 2605.28035 · v1 · pith:GSENSGWDnew · submitted 2026-05-27 · 💻 cs.AI · cs.MM· cs.SD

MTAVG-Bench 2.0: Diagnosing Failure Modes of Cinematic Expressiveness in Multi-Talker Audio-Video Generation

Pith reviewed 2026-06-29 12:56 UTC · model grok-4.3

classification 💻 cs.AI cs.MMcs.SD
keywords multi-talker audio-video generationcinematic expressivenessfailure diagnosisbenchmarkomni large language modelsshort-drama generationaudio-visual alignment
0
0 comments X

The pith

MTAVG-Bench 2.0 supplies over 10,000 QA instances and a four-category taxonomy to diagnose why multi-talker video generators fail at cinematic expressiveness.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates MTAVG-Bench 2.0 to move evaluation of multi-talker audio-video generation beyond lip-sync and basic alignment toward scene-level cinematic qualities. It defines a failure taxonomy across acting, narrative, atmosphere, and audio-visual language, then builds question-answering instances that test whether large models can locate and classify those failures in short-drama clips. Experiments show commercial omni models such as Gemini outperform other evaluators yet still miss many complex failures. The work therefore supplies both a diagnostic tool and evidence that current generators and evaluators remain limited on higher-order expressiveness.

Core claim

MTAVG-Bench 2.0 establishes a high-level failure taxonomy spanning acting, narrative, atmosphere, and audio-visual language, then uses it to construct more than 10,000 question-answering instances plus short-drama and temporal-localization subsets that let omni large language models diagnose cinematic expressiveness failures in multi-talker scene generation; commercial models lead the evaluations but continue to struggle with complex cases.

What carries the argument

The four-part failure taxonomy (acting, narrative, atmosphere, audio-visual language) together with the constructed QA instances that turn those categories into diagnostic tests for generation outputs.

If this is right

  • Evaluation of MTAVG models must incorporate scene-level criteria for character performance and narrative coherence rather than relying only on low-level alignment metrics.
  • Omni large language models can serve as evaluators for these higher-level failures, though even the strongest ones require further improvement on complex cases.
  • Short-drama and temporal localization subsets enable more granular diagnosis of where and how failures occur within generated clips.
  • Development of next MTAVG systems should target the specific failure modes identified in the taxonomy.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The benchmark could be used as a training signal to steer generation models toward better cinematic qualities during fine-tuning.
  • Similar taxonomies might be adapted to single-talker or non-dialogue video generation tasks.
  • If the taxonomy proves stable across new model releases, it could become a standard yardstick for progress in expressive multi-character video.

Load-bearing premise

The chosen failure categories and the 10,000 QA instances fully and fairly represent the cinematic expressiveness problems that actually occur in multi-talker scene generation.

What would settle it

Release a new MTAVG model that scores high on the benchmark yet produces scenes judged incoherent by human viewers on acting or narrative grounds, or conversely a model that improves dramatically on human cinematic judgments while scoring low on the benchmark.

Figures

Figures reproduced from arXiv: 2605.28035 by Changsen Yuan, Dian Jin, Haitian Li, Heyan Huang, Jiajun Xu, Jingyun Liao, Jinxing Zhou, Liangji Chen, Tian Lan, Xian-Ling Mao, Xuefeng Chen, Xu Liu, Yanghao Zhou, Yiming Cheng, Yousheng Feng, Yu Bai, Yueying Liu, Ziqin Zhou.

Figure 1
Figure 1. Figure 1: Overview of the MTAVG-Bench 2.0 framework. The benchmark is constructed through a pipeline of film analysis and [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of MTAVG-Bench 2.0 construction framework. The pipeline consists of three stages. First, classical film [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Dataset statistics of MTAVG-Bench 2.0. Top: dis [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Failure rates across fine-grained failure modes for [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Case study of diagnostic QA for a failure case under [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Stacked sentiment composition across positive, neg [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: System prompt for scene-level shot analysis and de-identified JSON planning (Part I). 13 [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: System prompt for scene-level shot analysis and de-identified JSON planning (Part II). [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: System prompt for segment-level video generation. [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Shared judge prompt used for rationale-consistency evaluation. [PITH_FULL_IMAGE:figures/full_fig_p016_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Failure Rate on each failure mode 19 [PITH_FULL_IMAGE:figures/full_fig_p019_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Distribution of per-video error counts across the three top-level dimensions and the aggregate total. [PITH_FULL_IMAGE:figures/full_fig_p020_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Distribution of LLM-as-judge rationale￾consistency scores. Even though the surrounding scene shifts dramatically, the perfor￾mance remains largely blank, indicating missing environmental awareness and weakened interaction grounding [PITH_FULL_IMAGE:figures/full_fig_p020_13.png] view at source ↗
Figure 15
Figure 15. Figure 15: Acting-level failure case: speech mode confusion in dialogue performance. [PITH_FULL_IMAGE:figures/full_fig_p021_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Acting-level failure case: robotic movement in motion performance. [PITH_FULL_IMAGE:figures/full_fig_p021_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Acting-level failure case: missing environmental awareness during interaction. [PITH_FULL_IMAGE:figures/full_fig_p021_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Cinematography-level failure case: camera-action misalignment in intra-shot camera control. [PITH_FULL_IMAGE:figures/full_fig_p023_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Cinematography-level failure case: weak shot progression and abrupt escalation. [PITH_FULL_IMAGE:figures/full_fig_p023_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Cinematography-level failure case: 30-degree rule violation in inter-shot grammar. [PITH_FULL_IMAGE:figures/full_fig_p023_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Atmosphere-level failure case: background music remains soft despite the scene’s escalating chaos and emotional [PITH_FULL_IMAGE:figures/full_fig_p024_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: Customized Label Studio Interface for MTAVG-Bench 2.0 Annotation [PITH_FULL_IMAGE:figures/full_fig_p024_22.png] view at source ↗
read the original abstract

In recent years, Multi-Talker Audio-Video Generation (MTAVG) models have shown promising performance on fundamental metrics such as lip-sync and audio-visual alignment. However, these metrics remain insufficient for assessing cinematic expressiveness in scene-level generation. In multi-character scenes, generation models must go beyond audio-visual realism to convey coherent character performance and other higher-level cinematic qualities. To fill this gap, we introduce MTAVG-Bench 2.0, a benchmark for diagnosing failure modes of cinematic expressiveness in multi-talker audio-video generation. Unlike prior settings that mainly focus on the quality of basic multi-turn dialogue, MTAVG-Bench 2.0 targets short-drama and scene-level generation, and establishes a high-level failure taxonomy spanning acting, narrative, atmosphere, and audio-visual language. Based on this taxonomy, we construct more than 10,000 question-answering evaluation instances, together with subsets for short-drama-level assessment and temporal localization of failure modes, to systematically evaluate the ability of omni large language models to diagnose high-level audio-visual failures. Experimental results show that commercial omni models such as Gemini substantially outperform other evaluators, yet even the strongest models continue to struggle with complex failures in our benchmark. These results demonstrate that MTAVG-Bench 2.0 provides a systematic benchmark for failure diagnosis in cinematic multi-talker audio-video generation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces MTAVG-Bench 2.0, a benchmark for diagnosing failure modes of cinematic expressiveness in multi-talker audio-video generation. It defines a four-category taxonomy (acting, narrative, atmosphere, audio-visual language), constructs over 10,000 QA instances plus subsets for short-drama assessment and temporal localization, and evaluates omni LLMs, finding that commercial models such as Gemini substantially outperform others yet all struggle with complex failures.

Significance. If the taxonomy and instances are shown to be exhaustive and reliably annotated, the benchmark would fill a clear gap by moving beyond low-level metrics (lip-sync, alignment) to scene-level cinematic qualities, offering a diagnostic tool for MTAVG model development.

major comments (2)
  1. [Abstract] Abstract and construction description: the central claim that MTAVG-Bench 2.0 supplies a 'systematic benchmark' for failure diagnosis rests on the taxonomy being exhaustive and the >10k QA instances being reliable, yet no details are provided on taxonomy derivation, annotation protocol, inter-annotator agreement, or external validation. This omission directly undermines trustworthiness of the reported failure diagnoses.
  2. [Experiments] Experimental results section: the claim that Gemini 'substantially outperform[s] other evaluators' and that 'even the strongest models continue to struggle' cannot be interpreted without evidence that the evaluation instances faithfully instantiate the taxonomy without annotator bias or omitted modes; the reported performance gaps may reflect benchmark artifacts rather than model capabilities.
minor comments (1)
  1. [Abstract] The abstract refers to 'subsets for short-drama-level assessment and temporal localization' without clarifying how these subsets are constructed or sampled from the main 10k instances.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for highlighting the importance of benchmark reliability and construction details. We will revise the manuscript to incorporate the requested information on taxonomy derivation, annotation protocols, and validation, which will strengthen the claims regarding the systematic nature of MTAVG-Bench 2.0.

read point-by-point responses
  1. Referee: [Abstract] Abstract and construction description: the central claim that MTAVG-Bench 2.0 supplies a 'systematic benchmark' for failure diagnosis rests on the taxonomy being exhaustive and the >10k QA instances being reliable, yet no details are provided on taxonomy derivation, annotation protocol, inter-annotator agreement, or external validation. This omission directly undermines trustworthiness of the reported failure diagnoses.

    Authors: We agree that these methodological details are critical for establishing trustworthiness. In the revised manuscript, we will add a new subsection under 'Benchmark Construction' detailing: (1) taxonomy derivation via systematic review of cinematic expressiveness literature (e.g., acting theory, narrative structure) combined with iterative expert input from film studies collaborators; (2) the full annotation protocol, including guidelines, training procedures, and quality control steps; (3) inter-annotator agreement statistics (e.g., Fleiss' kappa) computed across multiple annotators on sampled instances; and (4) external validation efforts, such as alignment checks against existing scene-level video analysis resources. These additions will directly support the claim of a systematic benchmark. revision: yes

  2. Referee: [Experiments] Experimental results section: the claim that Gemini 'substantially outperform[s] other evaluators' and that 'even the strongest models continue to struggle' cannot be interpreted without evidence that the evaluation instances faithfully instantiate the taxonomy without annotator bias or omitted modes; the reported performance gaps may reflect benchmark artifacts rather than model capabilities.

    Authors: We concur that the experimental claims require supporting evidence of instance fidelity. The planned revisions to the construction section will provide this by documenting the taxonomy's coverage, annotation reliability metrics, and steps taken to minimize bias (e.g., diverse annotator backgrounds and adjudication processes). We will also add a limitations paragraph acknowledging that while the taxonomy targets major cinematic failure modes, complete exhaustiveness cannot be formally proven; however, the >10k instances and subsets for short-drama and temporal localization offer broad coverage. This will allow readers to evaluate whether performance differences reflect model capabilities rather than artifacts. revision: yes

Circularity Check

0 steps flagged

No significant circularity; benchmark construction is self-contained

full rationale

The paper introduces MTAVG-Bench 2.0 by defining a four-category failure taxonomy (acting, narrative, atmosphere, audio-visual language) and constructing >10k QA instances plus subsets for short-drama and temporal localization. This is the standard, non-circular workflow for benchmark papers: the taxonomy and instances are new artifacts created by the authors, and model evaluations (e.g., Gemini outperforming others) are performed against them without any equation, fitted parameter, or self-citation reducing a claimed result to its own inputs by construction. No self-definitional loops, fitted-input predictions, or load-bearing self-citations appear in the provided abstract or description. The central claim remains independent of the benchmark's internal construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the validity of the four-category failure taxonomy and the assumption that the 10,000 QA instances faithfully represent real cinematic failures. No free parameters are introduced. No new physical entities are postulated.

axioms (1)
  • domain assumption The proposed taxonomy of acting, narrative, atmosphere, and audio-visual language covers the relevant high-level cinematic qualities without significant omissions or overlaps.
    Invoked when constructing the evaluation instances and when interpreting model performance on complex failures.

pith-pipeline@v0.9.1-grok · 5851 in / 1329 out tokens · 21793 ms · 2026-06-29T12:56:22.810079+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

43 extracted references · 31 canonical work pages · 14 internal anchors

  1. [1]

    Inclusion AI, Biao Gong, Cheng Zou, Chuanyang Zheng, Chunluan Zhou, Canxi- ang Yan, Chunxiang Jin, Chunjie Shen, Dandan Zheng, Fudong Wang, et al. 2025. Ming-omni: A unified multimodal model for perception and generation.arXiv preprint arXiv:2506.09344(2025)

  2. [2]

    Zhe Cao, Tao Wang, Jiaming Wang, Yanghai Wang, Yuanxing Zhang, Jialu Chen, Miao Deng, Jiahao Wang, Yubin Guo, Chenxi Liao, et al. 2025. T2AV-Compass: Towards Unified Evaluation for Text-to-Audio-Video Generation.arXiv preprint arXiv:2512.21094(2025)

  3. [3]

    Zesen Cheng, Sicong Leng, Hang Zhang, Yifei Xin, Xin Li, Guanzheng Chen, Yongxin Zhu, Wenqi Zhang, Ziyang Luo, Deli Zhao, et al. 2024. Videollama 2: Advancing spatial-temporal modeling and audio understanding in video-llms. arXiv preprint arXiv:2406.07476(2024)

  4. [4]

    Ziyun Dai, Xiaoqiang Li, Shaohua Zhang, Yuanchen Wu, and Jide Li. 2025. See different, think better: Visual variations mitigating hallucinations in lvlms. In Proceedings of the 33rd ACM International Conference on Multimedia. 3310–3319

  5. [5]

    Yu Gao, Haoyuan Guo, Tuyen Hoang, Weilin Huang, Lu Jiang, Fangyuan Kong, Huixia Li, Jiashi Li, Liang Li, Xiaojie Li, et al. 2025. Seedance 1.0: Exploring the boundaries of video generation models.arXiv preprint arXiv:2506.09113(2025)

  6. [6]

    Xu Guo, Fulong Ye, Qichao Sun, Liyang Chen, Bingchuan Li, Pengze Zhang, Jiawei Liu, Songtao Zhao, Qian He, and Xiangwang Hou. 2026. DreamID-Omni: Unified Framework for Controllable Human-Centric Audio-Video Generation. arXiv preprint arXiv:2602.12160(2026)

  7. [7]

    Yuwei Guo, Ceyuan Yang, Ziyan Yang, Zhibei Ma, Zhijie Lin, Zhenheng Yang, Dahua Lin, and Lu Jiang. 2025. Long context tuning for video generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 17281– 17291

  8. [8]

    Yoav HaCohen, Benny Brazowski, Nisan Chiprut, Yaki Bitterman, Andrew Kvochko, Avishai Berkowitz, Daniel Shalem, Daphna Lifschitz, Dudu Moshe, Eitan Porat, et al. 2026. LTX-2: Efficient Joint Audio-Visual Foundation Model. arXiv preprint arXiv:2601.03233(2026)

  9. [9]

    Hui Han, Siyuan Li, Jiaqi Chen, Yiwen Yuan, Yuling Wu, Yufan Deng, Chak Tou Leong, Hanwen Du, Junchen Fu, Youhua Li, et al. 2025. Video-bench: Human- aligned video generation benchmark. InProceedings of the Computer Vision and Pattern Recognition Conference. 18858–18868

  10. [10]

    Teng Hu, Zhentao Yu, Guozhen Zhang, Zihan Su, Zhengguang Zhou, You- liang Zhang, Yuan Zhou, Qinglin Lu, and Ran Yi. 2025. Harmony: Harmoniz- ing Audio and Video Generation through Cross-Task Synergy.arXiv preprint arXiv:2511.21579(2025)

  11. [11]

    Daili Hua, Xizhi Wang, Bohan Zeng, Xinyi Huang, Hao Liang, Junbo Niu, Xinlong Chen, Quanqing Xu, and Wentao Zhang. 2025. Vabench: A comprehensive benchmark for audio-video generation.arXiv preprint arXiv:2512.09299(2025)

  12. [12]

    Kaiyi Huang, Yukun Huang, Xintao Wang, Zinan Lin, Xuefei Ning, Pengfei Wan, Di Zhang, Yu Wang, and Xihui Liu. 2025. Filmaster: Bridging cinematic principles and generative ai for automated film generation.arXiv preprint arXiv:2506.18899 (2025)

  13. [13]

    Xuekun Jiang, Anyi Rao, Jingbo Wang, Dahua Lin, and Bo Dai. 2024. Cinematic behavior transfer via nerf-based differentiable filming. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6723–6732

  14. [14]

    Jie Li, Hongyi Cai, Mingkang Dong, Muxin Pu, Shan You, Fei Wang, and Tao Huang. 2025. Pistachio: Towards Synthetic, Balanced, and Long-Form Video Anomaly Benchmarks.arXiv preprint arXiv:2511.19474(2025)

  15. [15]

    Zongxia Li, Xiyang Wu, Guangyao Shi, Yubin Qin, Hongyang Du, Fuxiao Liu, Tianyi Zhou, Dinesh Manocha, and Jordan Lee Boyd-Graber. 2025. Videohallu: Evaluating and mitigating multi-modal hallucinations on synthetic video under- standing.arXiv preprint arXiv:2505.01481(2025)

  16. [16]

    Han Lin, Abhay Zala, Jaemin Cho, and Mohit Bansal. 2023. Videodirectorgpt: Consistent multi-scene video generation via llm-guided planning.arXiv preprint arXiv:2309.15091(2023)

  17. [17]

    Kai Liu, Jungang Li, Yuchong Sun, Shengqiong Wu, Jianzhang Gao, Daoan Zhang, Wei Zhang, Sheng Jin, Sicheng Yu, Geng Zhan, et al . 2025. Javisgpt: A uni- fied multi-modal llm for sounding-video comprehension and generation.arXiv preprint arXiv:2512.22905(2025)

  18. [18]

    Kai Liu, Wei Li, Lai Chen, Shengqiong Wu, Yanhao Zheng, Jiayi Ji, Fan Zhou, Rongxin Jiang, Jiebo Luo, Hao Fei, et al. 2025. Javisdit: Joint audio-video diffusion transformer with hierarchical spatio-temporal prior synchronization.arXiv preprint arXiv:2503.23377(2025)

  19. [19]

    Kai Liu, Yanhao Zheng, Kai Wang, Shengqiong Wu, Rongjunchen Zhang, Jiebo Luo, Dimitrios Hatzinakos, Ziwei Liu, Hao Fei, and Tat-Seng Chua. 2026. Javis- DiT++: Unified Modeling and Optimization for Joint Audio-Video Generation. arXiv preprint arXiv:2602.19163(2026)

  20. [20]

    Zuyan Liu, Yuhao Dong, Jiahui Wang, Ziwei Liu, Winston Hu, Jiwen Lu, and Yongming Rao. 2025. Ola: Pushing the frontiers of omni-modal language model. arXiv preprint arXiv:2502.04328(2025)

  21. [21]

    Chetwin Low, Weimin Wang, and Calder Katyal. 2025. Ovi: Twin backbone cross-modal fusion for audio-video generation.arXiv preprint arXiv:2510.01284 (2025)

  22. [22]

    Yuxin Mao, Xuyang Shen, Jing Zhang, Zhen Qin, Jinxing Zhou, Mochu Xiang, Yiran Zhong, and Yuchao Dai. 2024. Tavgbench: Benchmarking text to audible- video generation. InProceedings of the 32nd ACM International Conference on Multimedia. 6607–6616

  23. [23]

    Chenyu Mu, Xin He, Qu Yang, Wanshun Chen, Jiadi Yao, Huang Liu, Zihao Yi, Bo Zhao, Xingyu Chen, Ruotian Ma, et al. 2026. The Script is All You Need: An Agentic Framework for Long-Horizon Dialogue-to-Cinematic Video Generation. arXiv preprint arXiv:2601.17737(2026)

  24. [24]

    Anyi Rao, Xuekun Jiang, Yuwei Guo, Linning Xu, Lei Yang, Libiao Jin, Dahua Lin, and Bo Dai. 2023. Dynamic storyboard generation in an engine-based virtual environment for video production. InACM SIGGRAPH 2023 Posters. 1–2

  25. [25]

    Haoyuan Shi, Yunxin Li, Nanhao Deng, Zhenran Xu, Xinyu Chen, Longyue Wang, Baotian Hu, and Min Zhang. 2026. MSVBench: Towards Human-Level Evaluation of Multi-Shot Video Generation.arXiv preprint arXiv:2602.23969(2026)

  26. [26]

    Yufei Shi, Weilong Yan, Gang Xu, Yumeng Li, Yucheng Chen, Zhenxi Li, Fei Yu, Ming Li, and Si Yong Yeo. 2025. Pvchat: Personalized video chat with one-shot learning. InProceedings of the IEEE/CVF International Conference on Computer Vision. 23321–23331

  27. [27]

    Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. 2023. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805(2023)

  28. [28]

    OpenMOSS Team, Donghua Yu, Mingshu Chen, Qi Chen, Qi Luo, Qianyi Wu, Qinyuan Cheng, Ruixiao Li, Tianyi Liang, Wenbo Zhang, et al . 2026. Mova: Towards scalable and synchronized video-audio generation.arXiv preprint arXiv:2602.08794(2026)

  29. [29]

    Duomin Wang, Wei Zuo, Aojie Li, Ling-Hao Chen, Xinyao Liao, Deyu Zhou, Zixin Yin, Xili Dai, Daxin Jiang, and Gang Yu. 2025. UniVerse-1: Unified Audio-Video Generation via Stitching of Experts.arXiv preprint arXiv:2509.06155(2025)

  30. [30]

    Xinran Wang, Songyu Xu, Xiangxuan Shan, Yuxuan Zhang, Muxi Diao, Xueyan Duan, Yanhua Huang, Kongming Liang, and Zhanyu Ma. 2025. Cinetechbench: A benchmark for cinematographic technique understanding and generation.arXiv preprint arXiv:2505.15145(2025)

  31. [31]

    Weijia Wu, Mingyu Liu, Zeyu Zhu, Xi Xia, Haoen Feng, Wen Wang, Kevin Qinghong Lin, Chunhua Shen, and Mike Zheng Shou. 2025. Moviebench: A hierarchical movie level dataset for long video generation. InProceedings of the Computer Vision and Pattern Recognition Conference. 28984–28994

  32. [32]

    Junfei Xiao, Ceyuan Yang, Lvmin Zhang, Shengqu Cai, Yang Zhao, Yuwei Guo, Gordon Wetzstein, Maneesh Agrawala, Alan Yuille, and Lu Jiang. 2025. Cap- tain cinema: Towards short movie generation. InThe Fourteenth International Conference on Learning Representations

  33. [33]

    Tianxin Xie, Wentao Lei, Guanjie Huang, Pengfei Zhang, Kai Jiang, Chunhui Zhang, Fengji Ma, Haoyu He, Han Zhang, Jiangshan He, et al. 2025. PhyAVBench: A Challenging Audio Physics-Sensitivity Benchmark for Physically Grounded Text-to-Audio-Video Generation.arXiv preprint arXiv:2512.23994(2025)

  34. [34]

    Jin Xu, Zhifang Guo, Jinzheng He, Hangrui Hu, Ting He, Shuai Bai, Keqin Chen, Jialin Wang, Yang Fan, Kai Dang, Bin Zhang, Xiong Wang, Yunfei Chu, and Junyang Lin. 2025. Qwen2.5-Omni Technical Report. arXiv:2503.20215 [cs.CL] https://arxiv.org/abs/2503.20215

  35. [35]

    Songlin Yang, Zhe Wang, Xuyi Yang, Songchun Zhang, Xianghao Kong, Taiyi Wu, Xiaotong Zhao, Ran Zhang, Alan Zhao, and Anyi Rao. [n. d.]. ShotVerse: Advancing Cinematic Camera Control for Text-Driven Multi-Shot Video Creation. ([n. d.])

  36. [36]

    Yuan Yao, Tianyu Yu, Ao Zhang, Chongyi Wang, Junbo Cui, Hongji Zhu, Tianchi Cai, Haoyu Li, Weilin Zhao, Zhihui He, Qianyu Chen, Huarong Zhou, Zhensheng Zou, Haoye Zhang, Shengding Hu, Zhi Zheng, Jie Zhou, Jie Cai, Xu Han, Guoyang Zeng, Dahai Li, Zhiyuan Liu, and Maosong Sun. 2024. MiniCPM-V: A GPT-4V Level MLLM on Your Phone. arXiv:2408.01800 [cs.CV] http...

  37. [37]

    Hanrong Ye, Chao-Han Huck Yang, Arushi Goel, Wei Huang, Ligeng Zhu, Yuan- hang Su, Sean Lin, An-Chieh Cheng, Zhen Wan, Jinchuan Tian, et al . 2025. OmniVinci: Enhancing Architecture and Data for Omni-Modal Understanding LLM.arXiv preprint arXiv:2510.15870(2025)

  38. [38]

    Fan Zhang, Shulin Tian, Ziqi Huang, Yu Qiao, and Ziwei Liu. 2025. Evaluation agent: Efficient and promptable evaluation framework for visual generative mod- els. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 7561–7582

  39. [39]

    Guozhen Zhang, Zixiang Zhou, Teng Hu, Ziqiao Peng, Youliang Zhang, Yi Chen, Yuan Zhou, Qinglin Lu, and Limin Wang. 2025. Uniavgen: Unified audio and video generation with asymmetric cross-modal interactions.arXiv preprint arXiv:2511.03334(2025)

  40. [40]

    Peixuan Zhang, Zijian Jia, Kaiqi Liu, Shuchen Weng, Si Li, and Boxin Shi. 2025. STAGE: Storyboard-Anchored Generation for Cinematic Multi-shot Narrative. arXiv preprint arXiv:2512.12372(2025)

  41. [41]

    Yang-Hao Zhou, Haitian Li, Rexar Lin, Heyan Huang, Jinxing Zhou, Changsen Yuan, Tian Lan, Ziqin Zhou, Yudong Li, Jiajun Xu, et al. 2026. MTAVG-Bench: A Comprehensive Benchmark for Evaluating Multi-Talker Dialogue-Centric Audio- Video Generation.arXiv preprint arXiv:2602.00607(2026). 9

  42. [42]

    Ziwei Zhou, Zeyuan Lai, Rui Wang, Yifan Yang, Zhen Xing, Yuqing Yang, Qi Dai, Lili Qiu, and Chong Luo. 2026. AVGen-Bench: A Task-Driven Benchmark for Multi-Granular Evaluation of Text-to-Audio-Video Generation.arXiv preprint arXiv:2604.08540(2026)

  43. [43]

    clip_summary

    Junchen Zhu, Huan Yang, Huiguo He, Wenjing Wang, Zixi Tuo, Wen-Huang Cheng, Lianli Gao, Jingkuan Song, and Jianlong Fu. 2023. Moviefactory: Auto- matic movie creation from text using large generative models for language and images. InProceedings of the 31st ACM International Conference on Multimedia. 9313–9319. 10 Appendix A Prompt Design for Benchmark Co...