pith. sign in

arxiv: 2605.20183 · v1 · pith:ZYNQLIP7new · submitted 2026-05-19 · 💻 cs.CV

MSAVBench: Towards Comprehensive and Reliable Evaluation of Multi-Shot Audio-Video Generation

Pith reviewed 2026-05-20 05:12 UTC · model grok-4.3

classification 💻 cs.CV
keywords multi-shot video generationaudio-video synthesisbenchmark evaluationgenerative modelshuman judgment alignmentvideo quality assessmentshot segmentation
0
0 comments X

The pith

MSAVBench is introduced as a benchmark and framework for evaluating multi-shot audio-video generation models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to establish a reliable way to evaluate models that generate multi-shot audio-video content, which is becoming important for real-world applications like storytelling videos. Existing methods are too limited in scope and use inflexible approaches that don't match human opinions well. By creating MSAVBench with coverage of four dimensions and smart evaluation tools like adaptive segmentation correction, the work aims to enable better assessment and improvement of these advanced generation systems. A sympathetic reader would care because better evaluation leads to better models that can produce coherent narratives with sound and visuals.

Core claim

The authors present MSAVBench as the first comprehensive benchmark for multi-shot audio-video generation that includes diverse task settings, varying shot counts up to 15, and non-realistic scenarios across video, audio, shot, and reference dimensions. The associated adaptive hybrid evaluation framework uses self-correction for segmentation, instance-wise rubrics for subjective metrics, and tool-grounded evidence extraction to improve robustness and achieve strong alignment with human judgments. Systematic testing of 19 models reveals ongoing challenges in director-level control and fine-grained audio-visual synchronization, suggesting that modular or agentic pipelines may help reduce the性能差

What carries the argument

MSAVBench, the benchmark dataset and adaptive hybrid evaluation framework that combines automatic tools with human-aligned metrics across multiple quality dimensions.

If this is right

  • Models can be systematically compared on their ability to handle complex multi-shot narratives.
  • Current generation systems need improvement in maintaining consistency across shots and synchronizing audio with visuals.
  • Agentic or modular approaches appear effective for enhancing open-source model performance.
  • Future research will benefit from released data and code for developing better evaluation methods.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This evaluation approach might be adapted to assess other types of generative media like text-to-video with longer sequences.
  • Researchers could use the benchmark to test new models specifically for narrative coherence in audio-visual outputs.
  • It may highlight the need for integrated training methods that jointly optimize video and audio components rather than separate modules.

Load-bearing premise

The four chosen dimensions along with the adaptive mechanisms and rubrics fully capture the quality of multi-shot audio-video generation without missing important failure cases or creating evaluation biases.

What would settle it

A large-scale human preference study on generated multi-shot videos where the benchmark rankings do not match the human rankings would indicate the evaluation is not reliable.

Figures

Figures reproduced from arXiv: 2605.20183 by Difan Zou, Hongming Shan, Junjie Zhou, Junqiu Yu, Kaixun Jiang, Kai Zhu, Lingyi Hong, Quanhao Li, Ruihang Chu, Shiwei Zhang, Xiang Wang, Xihui Liu, Yang Shi, Yefei He, Yingya Zhang, Yongming Li, Yujie Wei, Yujin Han, Yu Liu, Zhekai Chen, Zhen Xing, Zhihang Liu, Zhiwu Qing.

Figure 1
Figure 1. Figure 1: Overview of MSAVBench. Left: the benchmark spans four data dimensions, namely video, audio, shot, and reference, covering diverse prompts, shot counts, and realistic and non-realistic scenarios. Right: the evaluation suite assesses generated MSAV content at four levels, including global, cross-shot, intra-shot, and reference levels, using a hybrid strategy that combines specialized expert models, rubric-ba… view at source ↗
Figure 2
Figure 2. Figure 2: Diverse distribution of MSAVBench. The benchmark covers diverse generation categories [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overview of the MSAVBench evaluation framework. We first perform agentic pre￾processing with iterative shot self-correction to improve boundary quality. Metrics are then evaluated with stratified scoring paradigms, including expert models for well-defined tasks, rubric-based VLM scoring for subjective dimensions, and tool-grounded agentic scoring for complex properties. Cross-shot-level metrics. These metr… view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative failure cases of evaluated models. Examples include text rendering errors (A), counterfactual subject mismatches (B), audio-visual synchronization failures (C), layout control failures (D), and incorrect subject counts (E). Finding 2: Compared to basic audio-visual fidelity, open-source models lag significantly behind closed systems in “director-level” structural control and cinematic language.… view at source ↗
Figure 5
Figure 5. Figure 5: The data construction pipeline of MSAVBench. (1) Domain experts define an eight￾category seed taxonomy with fine-grained sub-categories, with diverse types of subject, scene, and visual style. (2) GPT-5.4 first samples (theme, subject, scene, style) quadruples and synthesises an initial multi-shot script with structured per-shot metadata; a Prompt-Enhancement (PE) model then rewrites it into the global-to-… view at source ↗
Figure 6
Figure 6. Figure 6: Long-tail cinematic-language and tonal distributions of MSAVBench. Shot scale, camera angle, transition type and tone×saturation distributions on the released 286-prompt suite. transitions span 4 major types (hard cut 66.9%, dissolve 18.7%, none 13.0%, match cut / fade 1.4%); and lighting is reported with 5 major types (natural, side, soft, neon, low-key). The distributions are plotted in [PITH_FULL_IMAGE… view at source ↗
Figure 7
Figure 7. Figure 7: Screenshot of the annotation interface used for pairwise expert evaluation. [PITH_FULL_IMAGE:figures/full_fig_p024_7.png] view at source ↗
read the original abstract

Video generation is rapidly evolving from single-shot synthesis to complex multi-shot audio-video (MSAV) narratives to meet real-world demands. However, evaluating such frontier models remains a fundamental challenge. Existing benchmarks are limited in scope and data diversity, and rely on rigid evaluation pipelines, preventing systematic and reliable assessment of modern MSAV models. To bridge these gaps, we introduce MSAVBench, the first comprehensive benchmark and adaptive hybrid evaluation framework for multi-shot audio-video generation. Our benchmark spans four key dimensions, video, audio, shot, and reference, covering diverse task settings, varying shot counts of up to 15, and challenging non-realistic scenarios. Our evaluation framework improves robustness through an adaptive self-correction mechanism for shot segmentation, instance-wise rubrics for subjective metrics, and tool-grounded evidence extraction for complex judgments. Furthermore, MSAVBench achieves high alignment with human judgments, reaching a Spearman rank correlation of 91.5%. Our systematic evaluation of 19 state-of-the-art closed- and open-source models shows that current systems still struggle with director-level control and fine-grained audio-visual synchronization, while modular or agentic generation pipelines offer a promising path toward narrowing the gap between open- and closed-source models. We will release the benchmark data and evaluation code to facilitate future research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces MSAVBench as the first comprehensive benchmark and adaptive hybrid evaluation framework for multi-shot audio-video (MSAV) generation. It spans four dimensions (video, audio, shot, reference), supports up to 15 shots and non-realistic scenarios, incorporates adaptive self-correction for shot segmentation, instance-wise rubrics, and tool-grounded extraction, reports a 91.5% Spearman rank correlation with human judgments, and evaluates 19 closed- and open-source models to highlight gaps in director-level control and audio-visual synchronization while noting promise in modular pipelines. The benchmark data and code will be released.

Significance. If the 91.5% correlation is shown to be independently validated without circularity in rubric design or human data collection, MSAVBench would provide a much-needed standardized and reliable tool for evaluating complex multi-shot generation models, addressing limitations in scope and rigidity of prior benchmarks. The systematic evaluation of 19 models offers concrete insights into current model weaknesses, and the explicit commitment to releasing benchmark data and evaluation code is a clear strength that supports reproducibility and community progress in this emerging area.

major comments (2)
  1. Abstract: The central reliability claim rests on achieving a Spearman rank correlation of 91.5% with human judgments. The manuscript provides no protocol details on human evaluation (annotator count, inter-rater reliability, blinding procedures, or confirmation that the four dimensions, adaptive rules, and instance-wise rubrics were not iteratively refined against the same human ratings used for the correlation). This directly bears on whether the metric demonstrates independent validity or internal consistency, as raised by the stress-test concern.
  2. Abstract (evaluation framework description): The adaptive self-correction for segmentation, instance-wise rubrics for subjective metrics, and tool-grounded evidence extraction are presented as key robustness improvements. However, without ablation results, concrete examples of failure-mode handling (e.g., for 15-shot non-realistic sequences), or evidence that these components avoid introducing new biases, it remains unclear whether the framework provides a complete and unbiased measure across the claimed diverse task settings.
minor comments (2)
  1. The abstract mentions 'challenging non-realistic scenarios' and 'director-level control' but does not define these terms or provide illustrative examples; adding a short definition or example in the introduction would improve clarity.
  2. The planned release of benchmark data and evaluation code is noted positively; including a brief statement on licensing or access method would further strengthen the reproducibility claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback and for recognizing the potential value of MSAVBench. We address each major comment below and commit to revisions that improve transparency and evidence for our claims.

read point-by-point responses
  1. Referee: [—] Abstract: The central reliability claim rests on achieving a Spearman rank correlation of 91.5% with human judgments. The manuscript provides no protocol details on human evaluation (annotator count, inter-rater reliability, blinding procedures, or confirmation that the four dimensions, adaptive rules, and instance-wise rubrics were not iteratively refined against the same human ratings used for the correlation). This directly bears on whether the metric demonstrates independent validity or internal consistency, as raised by the stress-test concern.

    Authors: We agree that the current manuscript lacks sufficient protocol details to fully substantiate independent validity and address potential circularity. In the revised version we will add a dedicated subsection on human evaluation methodology that specifies annotator count, inter-rater reliability, blinding procedures, and an explicit statement that rubric design and adaptive rules were finalized prior to and independently of the human ratings used for the reported Spearman correlation. We will also include a brief summary of these elements in the abstract. revision: yes

  2. Referee: [—] Abstract (evaluation framework description): The adaptive self-correction for segmentation, instance-wise rubrics for subjective metrics, and tool-grounded evidence extraction are presented as key robustness improvements. However, without ablation results, concrete examples of failure-mode handling (e.g., for 15-shot non-realistic sequences), or evidence that these components avoid introducing new biases, it remains unclear whether the framework provides a complete and unbiased measure across the claimed diverse task settings.

    Authors: We concur that ablation studies and concrete examples would strengthen the robustness claims. The revised manuscript will incorporate ablation experiments isolating the adaptive self-correction and instance-wise rubrics, quantitative results on their effect on human correlation, and qualitative examples of failure-mode handling for 15-shot non-realistic sequences. We will also add analysis showing how tool-grounded extraction reduces subjectivity relative to purely LLM-based scoring. revision: yes

Circularity Check

0 steps flagged

No significant circularity in MSAVBench benchmark and human alignment claim

full rationale

The paper introduces MSAVBench as a new benchmark spanning four dimensions with an adaptive hybrid evaluation framework featuring self-correction segmentation, instance-wise rubrics, and tool-grounded extraction. The 91.5% Spearman correlation is reported as alignment with independent human judgments rather than any internally fitted or self-defined metric. No equations, parameters, or derivations reduce the framework's outputs or validity to its own construction by definition. The central claims rest on external human ratings and systematic model evaluations, which are presented as falsifiable and independent of the benchmark's internal rules. This is a standard benchmark paper with external validation and no load-bearing self-citation chains or fitted predictions that collapse to inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the premise that human preference ratings constitute a reliable external ground truth and that the four chosen dimensions plus adaptive mechanisms adequately represent MSAV quality; no new physical entities or fitted constants are introduced.

axioms (1)
  • domain assumption Human judgments serve as the authoritative reference for validating automated evaluation metrics in generative media tasks.
    The 91.5% Spearman correlation is presented as evidence of reliability, implying human ratings are the target the benchmark aims to match.

pith-pipeline@v0.9.0 · 5846 in / 1326 out tokens · 43681 ms · 2026-05-20T05:12:37.480807+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

80 extracted references · 80 canonical work pages · 15 internal anchors

  1. [1]

    PaddleOCR 3.0 Technical Report, author=Cheng Cui and Ting Sun and Manhui Lin and Tingquan Gao and Yubo Zhang and Jiaxuan Liu and Xueqing Wang and Zelun Zhang and Changda Zhou and Hongen Liu and Yue Zhang and Wenyu Lv and Kui Huang and Yichao Zhang and Jing Zhang and Jun Zhang and Yi Liu and Dianhai Yu and Yanjun Ma, 2025

  2. [2]

    Qwen3-VL Technical Report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...

  3. [3]

    BlazePose: On-device Real-time Body Pose Tracking

    Valentin Bazarevsky, Ivan Grishchenko, Karthik Raveendran, Tyler Zhu, Fan Zhang, and Matthias Grundmann. Blazepose: On-device real-time body pose tracking.arXiv preprint arXiv:2006.10204, 2020

  4. [4]

    Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

    Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Do- minik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023

  5. [5]

    Video generation models as world simulators

    Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Leo Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, et al. Video generation models as world simulators. OpenAI Blog, 1(8):1, 2024

  6. [6]

    Kevin Cai, Chonghua Liu, and David M. Chan. Anim-400k: A large-scale dataset for automated end to end dubbing of video. InICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2024

  7. [7]

    T2av-compass: Towards unified evaluation for text-to-audio-video generation.arXiv preprint arXiv:2512.21094, 2025

    Zhe Cao, Tao Wang, Jiaming Wang, Yanghai Wang, Yuanxing Zhang, Jialu Chen, Miao Deng, Jiahao Wang, Yubin Guo, Chenxi Liao, et al. T2av-compass: Towards unified evaluation for text-to-audio-video generation.arXiv preprint arXiv:2512.21094, 2025

  8. [8]

    Talkvid: A large-scale diversified dataset for audio-driven talking head synthesis, 2025

    Shunian Chen, Hejin Huang, Yexin Liu, Zihan Ye, Pengcheng Chen, Chenghao Zhu, Michael Guan, Rongsheng Wang, Junying Chen, Guanbin Li, Ser-Nam Lim, Harry Yang, and Benyou Wang. Talkvid: A large-scale diversified dataset for audio-driven talking head synthesis, 2025

  9. [9]

    MMAudio: Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis, 2025

    Ho Kei Cheng, Masato Ishii, Akio Hayakawa, Takashi Shibuya, Alexander Schwing, and Yuki Mitsufuji. MMAudio: Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis, 2025. 10

  10. [10]

    W2v-bert: Combining contrastive learning and masked language modeling for self-supervised speech pre-training

    Yu-An Chung, Yu Zhang, Wei Han, Chung-Cheng Chiu, James Qin, Ruoming Pang, and Yonghui Wu. W2v-bert: Combining contrastive learning and masked language modeling for self-supervised speech pre-training. In2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pages 244–250. IEEE, 2021

  11. [11]

    Paddleocr-vl-1.5: Towards a multi-task 0.9b vlm for robust in-the-wild document parsing, 2026

    Cheng Cui, Ting Sun, Suyin Liang, Tingquan Gao, Zelun Zhang, Jiaxuan Liu, Xueqing Wang, Changda Zhou, Hongen Liu, Manhui Lin, Yue Zhang, Yubo Zhang, Yi Liu, Dianhai Yu, and Yanjun Ma. Paddleocr-vl-1.5: Towards a multi-task 0.9b vlm for robust in-the-wild document parsing, 2026

  12. [12]

    PaddleOCR-VL: Boosting Multilingual Document Parsing via a 0.9B Ultra-Compact Vision-Language Model, 2025

    Cheng Cui, Ting Sun, Suyin Liang, Tingquan Gao, Zelun Zhang, Jiaxuan Liu, Xueqing Wang, Changda Zhou, Hongen Liu, Manhui Lin, Yue Zhang, Yubo Zhang, Handong Zheng, Jing Zhang, Jun Zhang, Yi Liu, Dianhai Yu, and Yanjun Ma. PaddleOCR-VL: Boosting Multilingual Document Parsing via a 0.9B Ultra-Compact Vision-Language Model, 2025

  13. [13]

    Demucs: Deep extractor for music sources with extra unlabeled data remixed.arXiv preprint arXiv:1909.01174, 2019

    Alexandre Défossez, Nicolas Usunier, Léon Bottou, and Francis Bach. Demucs: Deep extractor for music sources with extra unlabeled data remixed.arXiv preprint arXiv:1909.01174, 2019

  14. [14]

    Arcface: Additive angular margin loss for deep face recognition

    Jiankang Deng, Jia Guo, Niannan Xue, and Stefanos Zafeiriou. Arcface: Additive angular margin loss for deep face recognition. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4690–4699, 2019

  15. [15]

    Scaling rectified flow trans- formers for high-resolution image synthesis

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow trans- formers for high-resolution image synthesis. InForty-first international conference on machine learning, 2024

  16. [16]

    Veo 3.1.https://deepmind.google/technologies/veo/, 2026

    Google DeepMind. Veo 3.1.https://deepmind.google/technologies/veo/, 2026

  17. [17]

    Audcast: Audio-driven human video generation by cascaded diffusion transformers

    Jiazhi Guan, Kaisiyuan Wang, Zhiliang Xu, Quanwei Yang, Yasheng Sun, Shengyi He, Borong Liang, Yukang Cao, Yingying Li, Haocheng Feng, et al. Audcast: Audio-driven human video generation by cascaded diffusion transformers. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10678–10689, 2025

  18. [18]

    Dreamid-omni: Unified framework for controllable human-centric audio-video generation.arXiv preprint arXiv:2602.12160, 2026

    Xu Guo, Fulong Ye, Qichao Sun, Liyang Chen, Bingchuan Li, Pengze Zhang, Jiawei Liu, Song- tao Zhao, Qian He, and Xiangwang Hou. Dreamid-omni: Unified framework for controllable human-centric audio-video generation.arXiv preprint arXiv:2602.12160, 2026

  19. [19]

    LTX-Video: Realtime Video Latent Diffusion

    Yoav HaCohen, Nisan Chiprut, Benny Brazowski, Daniel Shalem, Dudu Moshe, Eitan Richard- son, Eran Levin, Guy Shiran, Nir Zabari, Ori Gordon, Poriya Panet, Sapir Weissbuch, Victor Kulikov, Yaki Bitterman, Zeev Melumian, and Ofir Bibi. Ltx-video: Realtime video latent diffusion.arXiv preprint arXiv:2501.00103, 2024

  20. [20]

    Video-bench: Human-aligned video generation benchmark

    Hui Han, Siyuan Li, Jiaqi Chen, Yiwen Yuan, Yuling Wu, Yufan Deng, Chak Tou Leong, Hanwen Du, Junchen Fu, Youhua Li, et al. Video-bench: Human-aligned video generation benchmark. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 18858–18868, 2025

  21. [21]

    AesRM: Improving Video Aesthetics with Expert-Level Feedback

    Yujin Han, Yujie Wei, Yefei He, Xinyu Liu, Tianle Li, Zichao Yu, Andi Han, Shiwei Zhang, Tingyu Weng, and Difan Zou. Aesrm: Improving video aesthetics with expert-level feedback. arXiv preprint arXiv:2604.28078, 2026

  22. [22]

    HappyHorse.https://happyhorse.app/, 2026

    HappyHorse AI. HappyHorse.https://happyhorse.app/, 2026

  23. [23]

    Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

  24. [24]

    Video diffusion models.Advances in neural information processing systems, 35:8633–8646, 2022

    Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video diffusion models.Advances in neural information processing systems, 35:8633–8646, 2022

  25. [25]

    VABench: A Comprehensive Benchmark for Audio-Video Generation

    Daili Hua, Xizhi Wang, Bohan Zeng, Xinyi Huang, Hao Liang, Junbo Niu, Xinlong Chen, Quanqing Xu, and Wentao Zhang. Vabench: A comprehensive benchmark for audio-video generation.arXiv preprint arXiv:2512.09299, 2025. 11

  26. [26]

    Self forcing: Bridging the train-test gap in autoregressive video diffusion, 2025

    Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman. Self forcing: Bridging the train-test gap in autoregressive video diffusion, 2025

  27. [27]

    Vbench: Comprehensive benchmark suite for video generative models

    Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive benchmark suite for video generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21807–21818, 2024

  28. [28]

    Synchformer: Efficient synchronization from sparse cues

    Vladimir Iashin, Weidi Xie, Esa Rahtu, and Andrew Zisserman. Synchformer: Efficient synchronization from sparse cues. InICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5325–5329. IEEE, 2024

  29. [29]

    All-in-one metrical and functional structure analysis with neigh- borhood attentions on demixed audio

    Taejun Kim and Juhan Nam. All-in-one metrical and functional structure analysis with neigh- borhood attentions on demixed audio. InIEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 2023

  30. [30]

    HunyuanVideo: A Systematic Framework For Large Video Generative Models

    Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

  31. [31]

    Kling 3.0.https://klingai.com/global/, 2026

    Kuaishou Technology. Kling 3.0.https://klingai.com/global/, 2026

  32. [32]

    Latentsync: Taming audio-conditioned latent diffusion models for lip sync with syncnet supervision.arXiv preprint arXiv:2412.09262, 2024

    Chunyu Li, Chao Zhang, Weikai Xu, Jingyu Lin, Jinghui Xie, Weiguo Feng, Bingyue Peng, Cunjian Chen, and Weiwei Xing. Latentsync: Taming audio-conditioned latent diffusion models for lip sync with syncnet supervision.arXiv preprint arXiv:2412.09262, 2024

  33. [33]

    Lr-asd: Lightweight and robust network for active speaker detection.International Journal of Computer Vision, 133(7):4749–4769, 2025

    Junhua Liao, Haihan Duan, Kanghui Feng, Wanbing Zhao, Yanbing Yang, Liangyin Chen, and Yanru Chen. Lr-asd: Lightweight and robust network for active speaker detection.International Journal of Computer Vision, 133(7):4749–4769, 2025

  34. [34]

    Aibench: Evaluating visual-logical consistency in academic illustration generation.arXiv preprint arXiv:2603.28068, 2026

    Zhaohe Liao, Kaixun Jiang, Zhihang Liu, Yujie Wei, Junqiu Yu, Quanhao Li, Hong-Tao Yu, Pandeng Li, Yuzheng Wang, Zhen Xing, et al. Aibench: Evaluating visual-logical consistency in academic illustration generation.arXiv preprint arXiv:2603.28068, 2026

  35. [35]

    Javisgpt: A unified multi-modal llm for sounding-video comprehension and generation

    Kai Liu, Jungang Li, Yuchong Sun, Shengqiong Wu, Jianzhang Gao, Daoan Zhang, Wei Zhang, Sheng Jin, Sicheng Yu, Geng Zhan, Jiayi Ji, Fan Zhou, Liang Zheng, Shuicheng Y AN, Hao Fei, and Tat-Seng Chua. Javisgpt: A unified multi-modal llm for sounding-video comprehension and generation. InThe Thirty-ninth Annual Conference on Neural Information Processing Sys...

  36. [36]

    Javisdit++: Unified modeling and optimization for joint audio-video generation

    Kai Liu, Yanhao Zheng, Kai Wang, Shengqiong Wu, Rongjunchen Zhang, Jiebo Luo, Dimitrios Hatzinakos, Ziwei Liu, Hao Fei, and Tat-Seng Chua. Javisdit++: Unified modeling and optimization for joint audio-video generation. InThe Fourteenth International Conference on Learning Representations, 2026

  37. [37]

    Grounding dino: Marrying dino with grounded pre-training for open-set object detection

    Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. InEuropean conference on computer vision, pages 38–55. Springer, 2024

  38. [38]

    Evalcrafter: Benchmarking and evaluating large video generation models.arXiv preprint arXiv:2310.11440, 2023

    Yaofang Liu, Xiaodong Cun, Xuebo Liu, Xintao Wang, Yong Zhang, Haoxin Chen, Yang Liu, Tieyong Zeng, Raymond Chan, and Ying Shan. Evalcrafter: Benchmarking and evaluating large video generation models.arXiv preprint arXiv:2310.11440, 2023

  39. [39]

    Shotstream: Streaming multi-shot video generation for interactive storytelling

    Yawen Luo, Xiaoyu Shi, Junhao Zhuang, Yutian Chen, Quande Liu, Xintao Wang, Pengfei Wan, and Tianfan Xue. Shotstream: Streaming multi-shot video generation for interactive storytelling. arXiv preprint arXiv:2603.25746, 2026

  40. [40]

    Wan-Image: Pushing the Boundaries of Generative Visual Intelligence

    Chaojie Mao, Chen-Wei Xie, Chongyang Zhong, Haoyou Deng, Jiaxing Zhao, Jie Xiao, Jinbo Xing, Jingfeng Zhang, Jingren Zhou, Jingyi Zhang, et al. Wan-image: Pushing the boundaries of generative visual intelligence.arXiv preprint arXiv:2604.19858, 2026

  41. [41]

    OenAI. GPT-5.4. https://openai.com/zh-Hans-CN/index/introducing-gpt-5-4/ , 2026. 12

  42. [42]

    Sora 2.https://openai.com/index/sora-2/, 2025

    OpenAI. Sora 2.https://openai.com/index/sora-2/, 2025

  43. [43]

    DINOv2: Learning Robust Visual Features without Supervision

    Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193, 2023

  44. [44]

    Sortformer: A novel approach for permutation-resolved speaker supervision in speech-to-text systems.arXiv preprint arXiv:2409.06656, 2024

    Taejin Park, Ivan Medennikov, Kunal Dhawan, Weiqing Wang, He Huang, Nithin Rao Koluguri, Krishna C Puvvada, Jagadeesh Balam, and Boris Ginsburg. Sortformer: A novel approach for permutation-resolved speaker supervision in speech-to-text systems.arXiv preprint arXiv:2409.06656, 2024

  45. [45]

    Movie Gen: A Cast of Media Foundation Models

    Adam Polyak, Amit Zohar, Andrew Brown, Andros Tjandra, Animesh Sinha, Ann Lee, Apoorv Vyas, Bowen Shi, Chih-Yao Ma, Ching-Yao Chuang, et al. Movie gen: A cast of media foundation models.arXiv preprint arXiv:2410.13720, 2024

  46. [46]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021

  47. [47]

    Robust speech recognition via large-scale weak supervision

    Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision. InInternational conference on machine learning, pages 28492–28518. PMLR, 2023

  48. [48]

    Seedance 2.0: Advancing Video Generation for World Complexity

    Team Seedance, De Chen, Liyang Chen, Xin Chen, Ying Chen, Zhuo Chen, Zhuowei Chen, Feng Cheng, Tianheng Cheng, Yufeng Cheng, et al. Seedance 2.0: Advancing video generation for world complexity.arXiv preprint arXiv:2604.14148, 2026

  49. [49]

    Hunyuanvideo-foley: Multimodal diffusion with representation alignment for high- fidelity foley audio generation, 2025

    Sizhe Shan, Qiulin Li, Yutao Cui, Miles Yang, Yuehai Wang, Qun Yang, Jin Zhou, and Zhao Zhong. Hunyuanvideo-foley: Multimodal diffusion with representation alignment for high- fidelity foley audio generation, 2025

  50. [50]

    Msvbench: Towards human-level evaluation of multi-shot video generation

    Haoyuan Shi, Yunxin Li, Nanhao Deng, Zhenran Xu, Xinyu Chen, Longyue Wang, Baotian Hu, and Min Zhang. Msvbench: Towards human-level evaluation of multi-shot video generation. arXiv preprint arXiv:2602.23969, 2026

  51. [51]

    SII-GAIR, Sand. ai, Ethan Chern, Hansi Teng, Hanwen Sun, Hao Wang, Hong Pan, Hongyu Jia, Jiadi Su, Jin Li, Junjie Yu, Lijie Liu, Lingzhi Li, Lyumanshan Ye, Min Hu, Qiangang Wang, Quanwei Qi, Steffi Chern, Tao Bu, Taoran Wang, Teren Xu, Tianning Zhang, Tiantian Mi, Weixian Xu, Wenqiang Zhang, Wentai Zhang, Xianping Yi, Xiaojie Cai, Xiaoyang Kang, Yan Ma, Y...

  52. [52]

    Make-A-Video: Text-to-Video Generation without Text-Video Data

    Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, et al. Make-a-video: Text-to-video generation without text-video data.arXiv preprint arXiv:2209.14792, 2022

  53. [53]

    Measuring style similarity in diffusion models.arXiv preprint arXiv:2404.01292, 2024

    Gowthami Somepalli, Anubhav Gupta, Kamal Gupta, Shramay Palta, Micah Goldblum, Jonas Geiping, Abhinav Shrivastava, and Tom Goldstein. Measuring style similarity in diffusion models.arXiv preprint arXiv:2404.01292, 2024

  54. [54]

    Transnet v2: An effective deep network architecture for fast shot transition detection.arXiv preprint arXiv:2008.04838, 2020

    Tomáš Souˇcek and Jakub Lokoˇc. Transnet v2: An effective deep network architecture for fast shot transition detection.arXiv preprint arXiv:2008.04838, 2020

  55. [55]

    The proof and measurement of association between two things

    Charles Spearman. The proof and measurement of association between two things. 1961

  56. [56]

    Mova: Towards scalable and synchronized video-audio generation.arXiv preprint arXiv:2602.08794, 2026

    OpenMOSS Team, Donghua Yu, Mingshu Chen, Qi Chen, Qi Luo, Qianyi Wu, Qinyuan Cheng, Ruixiao Li, Tianyi Liang, Wenbo Zhang, et al. Mova: Towards scalable and synchronized video-audio generation.arXiv preprint arXiv:2602.08794, 2026

  57. [57]

    Qwen3.5: Accelerating productivity with native multimodal agents, February 2026

    Qwen Team. Qwen3.5: Accelerating productivity with native multimodal agents, February 2026. 13

  58. [58]

    Silero V AD: pre-trained enterprise-grade V oice Activity Detector (V AD), Number Detector and Language Classifier.https://github.com/snakers4/silero-vad, 2024

    Silero Team. Silero V AD: pre-trained enterprise-grade V oice Activity Detector (V AD), Number Detector and Language Classifier.https://github.com/snakers4/silero-vad, 2024

  59. [59]

    Gemini 3.1 Pro

    The Gemini Team. Gemini 3.1 Pro. https://blog.google/innovation-and-ai/ models-and-research/gemini-models/gemini-3-1-pro/, 2026

  60. [60]

    Meta audiobox aesthetics: Unified automatic quality assessment for speech, music, and sound

    Andros Tjandra, Yi-Chiao Wu, Baishan Guo, John Hoffman, Brian Ellis, Apoorv Vyas, Bowen Shi, Sanyuan Chen, Matt Le, Nick Zacharov, Carleigh Wood, Ann Lee, and Wei-Ning Hsu. Meta audiobox aesthetics: Unified automatic quality assessment for speech, music, and sound. 2025

  61. [61]

    Wan2.7.https://www.wan27.xyz/, 2026

    Tongyi Wanxiang Team. Wan2.7.https://www.wan27.xyz/, 2026

  62. [62]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

  63. [63]

    Av-dit: Efficient audio-visual diffusion transformer for joint audio and video generation,

    Kai Wang, Shijian Deng, Jing Shi, Dimitrios Hatzinakos, and Yapeng Tian. Av-dit: Effi- cient audio-visual diffusion transformer for joint audio and video generation.arXiv preprint arXiv:2406.07686, 2024

  64. [64]

    Japanese Anime Scenes

    Wei Wang. Japanese Anime Scenes. https://www.kaggle.com/datasets/weiwangk/ japanese-anime-scenes, 2023

  65. [65]

    Univbench: Towards unified evaluation for video foundation models

    Jianhui Wei, Xiaotian Zhang, Yichen Li, Yuan Wang, Yan Zhang, Ziyi Chen, Zhihang Tang, Wei Xu, and Zuozhu Liu. Univbench: Towards unified evaluation for video foundation models. arXiv preprint arXiv:2602.21835, 2026

  66. [66]

    Dreamvideo-omni: Omni-motion controlled multi-subject video customization with latent identity reinforcement learning.arXiv preprint arXiv:2603.12257, 2026

    Yujie Wei, Xinyu Liu, Shiwei Zhang, Hangjie Yuan, Jinbo Xing, Zhekai Chen, Xiang Wang, Haonan Qiu, Rui Zhao, Yutong Feng, et al. Dreamvideo-omni: Omni-motion controlled multi-subject video customization with latent identity reinforcement learning.arXiv preprint arXiv:2603.12257, 2026

  67. [67]

    Dreamvideo: Composing your dream videos with cus- tomized subject and motion

    Yujie Wei, Shiwei Zhang, Zhiwu Qing, Hangjie Yuan, Zhiheng Liu, Yu Liu, Yingya Zhang, Jingren Zhou, and Hongming Shan. Dreamvideo: Composing your dream videos with cus- tomized subject and motion. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6537–6549, 2024

  68. [68]

    Dreamrelation: Relation-centric video customization

    Yujie Wei, Shiwei Zhang, Hangjie Yuan, Biao Gong, Longxiang Tang, Xiang Wang, Haonan Qiu, Hengjia Li, Shuai Tan, Yingya Zhang, et al. Dreamrelation: Relation-centric video customization. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 12381–12393, 2025

  69. [69]

    Routing matters in moe: Scaling diffusion transformers with explicit routing guidance.arXiv preprint arXiv:2510.24711, 2025

    Yujie Wei, Shiwei Zhang, Hangjie Yuan, Yujin Han, Zhekai Chen, Jiayu Wang, Difan Zou, Xihui Liu, Yingya Zhang, Yu Liu, et al. Routing matters in moe: Scaling diffusion transformers with explicit routing guidance.arXiv preprint arXiv:2510.24711, 2025

  70. [70]

    Dreamvideo-2: Zero-shot subject-driven video customization with precise motion control.arXiv preprint arXiv:2410.13830, 2024

    Yujie Wei, Shiwei Zhang, Hangjie Yuan, Xiang Wang, Haonan Qiu, Rui Zhao, Yutong Feng, Feng Liu, Zhizhong Huang, Jiaxin Ye, et al. Dreamvideo-2: Zero-shot subject-driven video customization with precise motion control.arXiv preprint arXiv:2410.13830, 2024

  71. [71]

    PhyAVBench: A Challenging Audio Physics-Sensitivity Benchmark for Physically Grounded Text-to-Audio-Video Generation

    Tianxin Xie, Wentao Lei, Kai Jiang, Guanjie Huang, Pengfei Zhang, Chunhui Zhang, Fengji Ma, Haoyu He, Han Zhang, Jiangshan He, et al. Phyavbench: A challenging audio physics- sensitivity benchmark for physically grounded text-to-audio-video generation.arXiv preprint arXiv:2512.23994, 2025

  72. [72]

    Fireredasr2s: A state-of-the-art industrial-grade all-in-one automatic speech recognition system.arXiv preprint arXiv:2603.10420, 2026

    Kaituo Xu, Yan Jia, Kai Huang, Junjie Chen, Wenpeng Li, Kun Liu, Feng-Long Xie, Xu Tang, and Yao Hu. Fireredasr2s: A state-of-the-art industrial-grade all-in-one automatic speech recognition system.arXiv preprint arXiv:2603.10420, 2026

  73. [73]

    Longlive: Real-time interactive long video generation

    Shuai Yang, Wei Huang, Ruihang Chu, Yicheng Xiao, Yuyang Zhao, Xianbang Wang, Muyang Li, Enze Xie, Yingcong Chen, Yao Lu, and Song Hanand Yukang Chen. Longlive: Real-time interactive long video generation. 2025. 14

  74. [74]

    Opens2v-nexus: A detailed benchmark and million-scale dataset for subject-to-video generation,

    Shenghai Yuan, Xianyi He, Yufan Deng, Yang Ye, Jinfa Huang, Bin Lin, Jiebo Luo, and Li Yuan. Opens2v-nexus: A detailed benchmark and million-scale dataset for subject-to-video generation. arXiv preprint arXiv:2505.20292, 2025

  75. [75]

    Helios: Real real-time long video generation model.arXiv preprint arXiv:2603.04379, 2026

    Shenghai Yuan, Yuanyang Yin, Zongjian Li, Xinwei Huang, Xiao Yang, and Li Yuan. Helios: Real real-time long video generation model.arXiv preprint arXiv:2603.04379, 2026

  76. [76]

    Uniform: A unified multi-task diffusion transformer for audio- video generation,

    Lei Zhao, Linfeng Feng, Dongxu Ge, Rujin Chen, Fangqiu Yi, Chi Zhang, Xiao-Lei Zhang, and Xuelong Li. Uniform: A unified multi-task diffusion transformer for audio-video generation. arXiv preprint arXiv:2502.03897, 2025

  77. [77]

    MTAVG-Bench: A Diagnostic Benchmark for Multi-Talker Dialogue-Centric Audio-Video Generation

    Yang-Hao Zhou, Haitian Li, Rexar Lin, Heyan Huang, Jinxing Zhou, Changsen Yuan, Tian Lan, Ziqin Zhou, Yudong Li, Jiajun Xu, et al. Mtavg-bench: A comprehensive benchmark for evalu- ating multi-talker dialogue-centric audio-video generation.arXiv preprint arXiv:2602.00607, 2026

  78. [78]

    AVGen-Bench: A Task-Driven Benchmark for Multi-Granular Evaluation of Text-to-Audio-Video Generation

    Ziwei Zhou, Zeyuan Lai, Rui Wang, Yifan Yang, Zhen Xing, Yuqing Yang, Qi Dai, Lili Qiu, and Chong Luo. Avgen-bench: A task-driven benchmark for multi-granular evaluation of text-to-audio-video generation.arXiv preprint arXiv:2604.08540, 2026

  79. [79]

    Causal forcing: Autoregressive diffusion distillation done right for high-quality real-time interactive video generation, 2026

    Hongzhou Zhu, Min Zhao, Guande He, Hang Su, Chongxuan Li, and Jun Zhu. Causal forcing: Autoregressive diffusion distillation done right for high-quality real-time interactive video generation, 2026

  80. [80]

    Vistorybench: Comprehensive benchmark suite for story visualization.arXiv preprint arXiv:2505.24862, 2025

    Cailin Zhuang, Ailin Huang, Yaoqi Hu, Jingwei Wu, Wei Cheng, Jiaqi Liao, Hongyuan Wang, Xinyao Liao, Weiwei Cai, Hengyuan Xu, et al. Vistorybench: Comprehensive benchmark suite for story visualization.arXiv preprint arXiv:2505.24862, 2025. 15 Appendix A More Data Details on MSA VBench 17 A.1 Data Design Details . . . . . . . . . . . . . . . . . . . . . . ...