MSAVBench: Towards Comprehensive and Reliable Evaluation of Multi-Shot Audio-Video Generation
Pith reviewed 2026-05-20 05:12 UTC · model grok-4.3
The pith
MSAVBench is introduced as a benchmark and framework for evaluating multi-shot audio-video generation models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors present MSAVBench as the first comprehensive benchmark for multi-shot audio-video generation that includes diverse task settings, varying shot counts up to 15, and non-realistic scenarios across video, audio, shot, and reference dimensions. The associated adaptive hybrid evaluation framework uses self-correction for segmentation, instance-wise rubrics for subjective metrics, and tool-grounded evidence extraction to improve robustness and achieve strong alignment with human judgments. Systematic testing of 19 models reveals ongoing challenges in director-level control and fine-grained audio-visual synchronization, suggesting that modular or agentic pipelines may help reduce the性能差
What carries the argument
MSAVBench, the benchmark dataset and adaptive hybrid evaluation framework that combines automatic tools with human-aligned metrics across multiple quality dimensions.
If this is right
- Models can be systematically compared on their ability to handle complex multi-shot narratives.
- Current generation systems need improvement in maintaining consistency across shots and synchronizing audio with visuals.
- Agentic or modular approaches appear effective for enhancing open-source model performance.
- Future research will benefit from released data and code for developing better evaluation methods.
Where Pith is reading between the lines
- This evaluation approach might be adapted to assess other types of generative media like text-to-video with longer sequences.
- Researchers could use the benchmark to test new models specifically for narrative coherence in audio-visual outputs.
- It may highlight the need for integrated training methods that jointly optimize video and audio components rather than separate modules.
Load-bearing premise
The four chosen dimensions along with the adaptive mechanisms and rubrics fully capture the quality of multi-shot audio-video generation without missing important failure cases or creating evaluation biases.
What would settle it
A large-scale human preference study on generated multi-shot videos where the benchmark rankings do not match the human rankings would indicate the evaluation is not reliable.
Figures
read the original abstract
Video generation is rapidly evolving from single-shot synthesis to complex multi-shot audio-video (MSAV) narratives to meet real-world demands. However, evaluating such frontier models remains a fundamental challenge. Existing benchmarks are limited in scope and data diversity, and rely on rigid evaluation pipelines, preventing systematic and reliable assessment of modern MSAV models. To bridge these gaps, we introduce MSAVBench, the first comprehensive benchmark and adaptive hybrid evaluation framework for multi-shot audio-video generation. Our benchmark spans four key dimensions, video, audio, shot, and reference, covering diverse task settings, varying shot counts of up to 15, and challenging non-realistic scenarios. Our evaluation framework improves robustness through an adaptive self-correction mechanism for shot segmentation, instance-wise rubrics for subjective metrics, and tool-grounded evidence extraction for complex judgments. Furthermore, MSAVBench achieves high alignment with human judgments, reaching a Spearman rank correlation of 91.5%. Our systematic evaluation of 19 state-of-the-art closed- and open-source models shows that current systems still struggle with director-level control and fine-grained audio-visual synchronization, while modular or agentic generation pipelines offer a promising path toward narrowing the gap between open- and closed-source models. We will release the benchmark data and evaluation code to facilitate future research.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces MSAVBench as the first comprehensive benchmark and adaptive hybrid evaluation framework for multi-shot audio-video (MSAV) generation. It spans four dimensions (video, audio, shot, reference), supports up to 15 shots and non-realistic scenarios, incorporates adaptive self-correction for shot segmentation, instance-wise rubrics, and tool-grounded extraction, reports a 91.5% Spearman rank correlation with human judgments, and evaluates 19 closed- and open-source models to highlight gaps in director-level control and audio-visual synchronization while noting promise in modular pipelines. The benchmark data and code will be released.
Significance. If the 91.5% correlation is shown to be independently validated without circularity in rubric design or human data collection, MSAVBench would provide a much-needed standardized and reliable tool for evaluating complex multi-shot generation models, addressing limitations in scope and rigidity of prior benchmarks. The systematic evaluation of 19 models offers concrete insights into current model weaknesses, and the explicit commitment to releasing benchmark data and evaluation code is a clear strength that supports reproducibility and community progress in this emerging area.
major comments (2)
- Abstract: The central reliability claim rests on achieving a Spearman rank correlation of 91.5% with human judgments. The manuscript provides no protocol details on human evaluation (annotator count, inter-rater reliability, blinding procedures, or confirmation that the four dimensions, adaptive rules, and instance-wise rubrics were not iteratively refined against the same human ratings used for the correlation). This directly bears on whether the metric demonstrates independent validity or internal consistency, as raised by the stress-test concern.
- Abstract (evaluation framework description): The adaptive self-correction for segmentation, instance-wise rubrics for subjective metrics, and tool-grounded evidence extraction are presented as key robustness improvements. However, without ablation results, concrete examples of failure-mode handling (e.g., for 15-shot non-realistic sequences), or evidence that these components avoid introducing new biases, it remains unclear whether the framework provides a complete and unbiased measure across the claimed diverse task settings.
minor comments (2)
- The abstract mentions 'challenging non-realistic scenarios' and 'director-level control' but does not define these terms or provide illustrative examples; adding a short definition or example in the introduction would improve clarity.
- The planned release of benchmark data and evaluation code is noted positively; including a brief statement on licensing or access method would further strengthen the reproducibility claim.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback and for recognizing the potential value of MSAVBench. We address each major comment below and commit to revisions that improve transparency and evidence for our claims.
read point-by-point responses
-
Referee: [—] Abstract: The central reliability claim rests on achieving a Spearman rank correlation of 91.5% with human judgments. The manuscript provides no protocol details on human evaluation (annotator count, inter-rater reliability, blinding procedures, or confirmation that the four dimensions, adaptive rules, and instance-wise rubrics were not iteratively refined against the same human ratings used for the correlation). This directly bears on whether the metric demonstrates independent validity or internal consistency, as raised by the stress-test concern.
Authors: We agree that the current manuscript lacks sufficient protocol details to fully substantiate independent validity and address potential circularity. In the revised version we will add a dedicated subsection on human evaluation methodology that specifies annotator count, inter-rater reliability, blinding procedures, and an explicit statement that rubric design and adaptive rules were finalized prior to and independently of the human ratings used for the reported Spearman correlation. We will also include a brief summary of these elements in the abstract. revision: yes
-
Referee: [—] Abstract (evaluation framework description): The adaptive self-correction for segmentation, instance-wise rubrics for subjective metrics, and tool-grounded evidence extraction are presented as key robustness improvements. However, without ablation results, concrete examples of failure-mode handling (e.g., for 15-shot non-realistic sequences), or evidence that these components avoid introducing new biases, it remains unclear whether the framework provides a complete and unbiased measure across the claimed diverse task settings.
Authors: We concur that ablation studies and concrete examples would strengthen the robustness claims. The revised manuscript will incorporate ablation experiments isolating the adaptive self-correction and instance-wise rubrics, quantitative results on their effect on human correlation, and qualitative examples of failure-mode handling for 15-shot non-realistic sequences. We will also add analysis showing how tool-grounded extraction reduces subjectivity relative to purely LLM-based scoring. revision: yes
Circularity Check
No significant circularity in MSAVBench benchmark and human alignment claim
full rationale
The paper introduces MSAVBench as a new benchmark spanning four dimensions with an adaptive hybrid evaluation framework featuring self-correction segmentation, instance-wise rubrics, and tool-grounded extraction. The 91.5% Spearman correlation is reported as alignment with independent human judgments rather than any internally fitted or self-defined metric. No equations, parameters, or derivations reduce the framework's outputs or validity to its own construction by definition. The central claims rest on external human ratings and systematic model evaluations, which are presented as falsifiable and independent of the benchmark's internal rules. This is a standard benchmark paper with external validation and no load-bearing self-citation chains or fitted predictions that collapse to inputs.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Human judgments serve as the authoritative reference for validating automated evaluation metrics in generative media tasks.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Our evaluation framework improves robustness through an adaptive self-correction mechanism for shot segmentation, instance-wise rubrics for subjective metrics, and tool-grounded evidence extraction for complex judgments.
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Spearman rank correlation of 91.5% with human judgments
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
PaddleOCR 3.0 Technical Report, author=Cheng Cui and Ting Sun and Manhui Lin and Tingquan Gao and Yubo Zhang and Jiaxuan Liu and Xueqing Wang and Zelun Zhang and Changda Zhou and Hongen Liu and Yue Zhang and Wenyu Lv and Kui Huang and Yichao Zhang and Jing Zhang and Jun Zhang and Yi Liu and Dianhai Yu and Yanjun Ma, 2025
work page 2025
-
[2]
Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
BlazePose: On-device Real-time Body Pose Tracking
Valentin Bazarevsky, Ivan Grishchenko, Karthik Raveendran, Tyler Zhu, Fan Zhang, and Matthias Grundmann. Blazepose: On-device real-time body pose tracking.arXiv preprint arXiv:2006.10204, 2020
-
[4]
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets
Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Do- minik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[5]
Video generation models as world simulators
Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Leo Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, et al. Video generation models as world simulators. OpenAI Blog, 1(8):1, 2024
work page 2024
-
[6]
Kevin Cai, Chonghua Liu, and David M. Chan. Anim-400k: A large-scale dataset for automated end to end dubbing of video. InICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2024
work page 2024
-
[7]
Zhe Cao, Tao Wang, Jiaming Wang, Yanghai Wang, Yuanxing Zhang, Jialu Chen, Miao Deng, Jiahao Wang, Yubin Guo, Chenxi Liao, et al. T2av-compass: Towards unified evaluation for text-to-audio-video generation.arXiv preprint arXiv:2512.21094, 2025
-
[8]
Talkvid: A large-scale diversified dataset for audio-driven talking head synthesis, 2025
Shunian Chen, Hejin Huang, Yexin Liu, Zihan Ye, Pengcheng Chen, Chenghao Zhu, Michael Guan, Rongsheng Wang, Junying Chen, Guanbin Li, Ser-Nam Lim, Harry Yang, and Benyou Wang. Talkvid: A large-scale diversified dataset for audio-driven talking head synthesis, 2025
work page 2025
-
[9]
MMAudio: Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis, 2025
Ho Kei Cheng, Masato Ishii, Akio Hayakawa, Takashi Shibuya, Alexander Schwing, and Yuki Mitsufuji. MMAudio: Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis, 2025. 10
work page 2025
-
[10]
Yu-An Chung, Yu Zhang, Wei Han, Chung-Cheng Chiu, James Qin, Ruoming Pang, and Yonghui Wu. W2v-bert: Combining contrastive learning and masked language modeling for self-supervised speech pre-training. In2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pages 244–250. IEEE, 2021
work page 2021
-
[11]
Paddleocr-vl-1.5: Towards a multi-task 0.9b vlm for robust in-the-wild document parsing, 2026
Cheng Cui, Ting Sun, Suyin Liang, Tingquan Gao, Zelun Zhang, Jiaxuan Liu, Xueqing Wang, Changda Zhou, Hongen Liu, Manhui Lin, Yue Zhang, Yubo Zhang, Yi Liu, Dianhai Yu, and Yanjun Ma. Paddleocr-vl-1.5: Towards a multi-task 0.9b vlm for robust in-the-wild document parsing, 2026
work page 2026
-
[12]
Cheng Cui, Ting Sun, Suyin Liang, Tingquan Gao, Zelun Zhang, Jiaxuan Liu, Xueqing Wang, Changda Zhou, Hongen Liu, Manhui Lin, Yue Zhang, Yubo Zhang, Handong Zheng, Jing Zhang, Jun Zhang, Yi Liu, Dianhai Yu, and Yanjun Ma. PaddleOCR-VL: Boosting Multilingual Document Parsing via a 0.9B Ultra-Compact Vision-Language Model, 2025
work page 2025
-
[13]
Alexandre Défossez, Nicolas Usunier, Léon Bottou, and Francis Bach. Demucs: Deep extractor for music sources with extra unlabeled data remixed.arXiv preprint arXiv:1909.01174, 2019
-
[14]
Arcface: Additive angular margin loss for deep face recognition
Jiankang Deng, Jia Guo, Niannan Xue, and Stefanos Zafeiriou. Arcface: Additive angular margin loss for deep face recognition. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4690–4699, 2019
work page 2019
-
[15]
Scaling rectified flow trans- formers for high-resolution image synthesis
Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow trans- formers for high-resolution image synthesis. InForty-first international conference on machine learning, 2024
work page 2024
-
[16]
Veo 3.1.https://deepmind.google/technologies/veo/, 2026
Google DeepMind. Veo 3.1.https://deepmind.google/technologies/veo/, 2026
work page 2026
-
[17]
Audcast: Audio-driven human video generation by cascaded diffusion transformers
Jiazhi Guan, Kaisiyuan Wang, Zhiliang Xu, Quanwei Yang, Yasheng Sun, Shengyi He, Borong Liang, Yukang Cao, Yingying Li, Haocheng Feng, et al. Audcast: Audio-driven human video generation by cascaded diffusion transformers. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10678–10689, 2025
work page 2025
-
[18]
Xu Guo, Fulong Ye, Qichao Sun, Liyang Chen, Bingchuan Li, Pengze Zhang, Jiawei Liu, Song- tao Zhao, Qian He, and Xiangwang Hou. Dreamid-omni: Unified framework for controllable human-centric audio-video generation.arXiv preprint arXiv:2602.12160, 2026
-
[19]
LTX-Video: Realtime Video Latent Diffusion
Yoav HaCohen, Nisan Chiprut, Benny Brazowski, Daniel Shalem, Dudu Moshe, Eitan Richard- son, Eran Levin, Guy Shiran, Nir Zabari, Ori Gordon, Poriya Panet, Sapir Weissbuch, Victor Kulikov, Yaki Bitterman, Zeev Melumian, and Ofir Bibi. Ltx-video: Realtime video latent diffusion.arXiv preprint arXiv:2501.00103, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[20]
Video-bench: Human-aligned video generation benchmark
Hui Han, Siyuan Li, Jiaqi Chen, Yiwen Yuan, Yuling Wu, Yufan Deng, Chak Tou Leong, Hanwen Du, Junchen Fu, Youhua Li, et al. Video-bench: Human-aligned video generation benchmark. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 18858–18868, 2025
work page 2025
-
[21]
AesRM: Improving Video Aesthetics with Expert-Level Feedback
Yujin Han, Yujie Wei, Yefei He, Xinyu Liu, Tianle Li, Zichao Yu, Andi Han, Shiwei Zhang, Tingyu Weng, and Difan Zou. Aesrm: Improving video aesthetics with expert-level feedback. arXiv preprint arXiv:2604.28078, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[22]
HappyHorse.https://happyhorse.app/, 2026
HappyHorse AI. HappyHorse.https://happyhorse.app/, 2026
work page 2026
-
[23]
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020
work page 2020
-
[24]
Video diffusion models.Advances in neural information processing systems, 35:8633–8646, 2022
Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video diffusion models.Advances in neural information processing systems, 35:8633–8646, 2022
work page 2022
-
[25]
VABench: A Comprehensive Benchmark for Audio-Video Generation
Daili Hua, Xizhi Wang, Bohan Zeng, Xinyi Huang, Hao Liang, Junbo Niu, Xinlong Chen, Quanqing Xu, and Wentao Zhang. Vabench: A comprehensive benchmark for audio-video generation.arXiv preprint arXiv:2512.09299, 2025. 11
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[26]
Self forcing: Bridging the train-test gap in autoregressive video diffusion, 2025
Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman. Self forcing: Bridging the train-test gap in autoregressive video diffusion, 2025
work page 2025
-
[27]
Vbench: Comprehensive benchmark suite for video generative models
Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive benchmark suite for video generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21807–21818, 2024
work page 2024
-
[28]
Synchformer: Efficient synchronization from sparse cues
Vladimir Iashin, Weidi Xie, Esa Rahtu, and Andrew Zisserman. Synchformer: Efficient synchronization from sparse cues. InICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5325–5329. IEEE, 2024
work page 2024
-
[29]
Taejun Kim and Juhan Nam. All-in-one metrical and functional structure analysis with neigh- borhood attentions on demixed audio. InIEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 2023
work page 2023
-
[30]
HunyuanVideo: A Systematic Framework For Large Video Generative Models
Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[31]
Kling 3.0.https://klingai.com/global/, 2026
Kuaishou Technology. Kling 3.0.https://klingai.com/global/, 2026
work page 2026
-
[32]
Chunyu Li, Chao Zhang, Weikai Xu, Jingyu Lin, Jinghui Xie, Weiguo Feng, Bingyue Peng, Cunjian Chen, and Weiwei Xing. Latentsync: Taming audio-conditioned latent diffusion models for lip sync with syncnet supervision.arXiv preprint arXiv:2412.09262, 2024
-
[33]
Junhua Liao, Haihan Duan, Kanghui Feng, Wanbing Zhao, Yanbing Yang, Liangyin Chen, and Yanru Chen. Lr-asd: Lightweight and robust network for active speaker detection.International Journal of Computer Vision, 133(7):4749–4769, 2025
work page 2025
-
[34]
Zhaohe Liao, Kaixun Jiang, Zhihang Liu, Yujie Wei, Junqiu Yu, Quanhao Li, Hong-Tao Yu, Pandeng Li, Yuzheng Wang, Zhen Xing, et al. Aibench: Evaluating visual-logical consistency in academic illustration generation.arXiv preprint arXiv:2603.28068, 2026
-
[35]
Javisgpt: A unified multi-modal llm for sounding-video comprehension and generation
Kai Liu, Jungang Li, Yuchong Sun, Shengqiong Wu, Jianzhang Gao, Daoan Zhang, Wei Zhang, Sheng Jin, Sicheng Yu, Geng Zhan, Jiayi Ji, Fan Zhou, Liang Zheng, Shuicheng Y AN, Hao Fei, and Tat-Seng Chua. Javisgpt: A unified multi-modal llm for sounding-video comprehension and generation. InThe Thirty-ninth Annual Conference on Neural Information Processing Sys...
work page 2025
-
[36]
Javisdit++: Unified modeling and optimization for joint audio-video generation
Kai Liu, Yanhao Zheng, Kai Wang, Shengqiong Wu, Rongjunchen Zhang, Jiebo Luo, Dimitrios Hatzinakos, Ziwei Liu, Hao Fei, and Tat-Seng Chua. Javisdit++: Unified modeling and optimization for joint audio-video generation. InThe Fourteenth International Conference on Learning Representations, 2026
work page 2026
-
[37]
Grounding dino: Marrying dino with grounded pre-training for open-set object detection
Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. InEuropean conference on computer vision, pages 38–55. Springer, 2024
work page 2024
-
[38]
Yaofang Liu, Xiaodong Cun, Xuebo Liu, Xintao Wang, Yong Zhang, Haoxin Chen, Yang Liu, Tieyong Zeng, Raymond Chan, and Ying Shan. Evalcrafter: Benchmarking and evaluating large video generation models.arXiv preprint arXiv:2310.11440, 2023
-
[39]
Shotstream: Streaming multi-shot video generation for interactive storytelling
Yawen Luo, Xiaoyu Shi, Junhao Zhuang, Yutian Chen, Quande Liu, Xintao Wang, Pengfei Wan, and Tianfan Xue. Shotstream: Streaming multi-shot video generation for interactive storytelling. arXiv preprint arXiv:2603.25746, 2026
-
[40]
Wan-Image: Pushing the Boundaries of Generative Visual Intelligence
Chaojie Mao, Chen-Wei Xie, Chongyang Zhong, Haoyou Deng, Jiaxing Zhao, Jie Xiao, Jinbo Xing, Jingfeng Zhang, Jingren Zhou, Jingyi Zhang, et al. Wan-image: Pushing the boundaries of generative visual intelligence.arXiv preprint arXiv:2604.19858, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[41]
OenAI. GPT-5.4. https://openai.com/zh-Hans-CN/index/introducing-gpt-5-4/ , 2026. 12
work page 2026
-
[42]
Sora 2.https://openai.com/index/sora-2/, 2025
OpenAI. Sora 2.https://openai.com/index/sora-2/, 2025
work page 2025
-
[43]
DINOv2: Learning Robust Visual Features without Supervision
Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[44]
Taejin Park, Ivan Medennikov, Kunal Dhawan, Weiqing Wang, He Huang, Nithin Rao Koluguri, Krishna C Puvvada, Jagadeesh Balam, and Boris Ginsburg. Sortformer: A novel approach for permutation-resolved speaker supervision in speech-to-text systems.arXiv preprint arXiv:2409.06656, 2024
-
[45]
Movie Gen: A Cast of Media Foundation Models
Adam Polyak, Amit Zohar, Andrew Brown, Andros Tjandra, Animesh Sinha, Ann Lee, Apoorv Vyas, Bowen Shi, Chih-Yao Ma, Ching-Yao Chuang, et al. Movie gen: A cast of media foundation models.arXiv preprint arXiv:2410.13720, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[46]
Learning transferable visual models from natural language supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021
work page 2021
-
[47]
Robust speech recognition via large-scale weak supervision
Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision. InInternational conference on machine learning, pages 28492–28518. PMLR, 2023
work page 2023
-
[48]
Seedance 2.0: Advancing Video Generation for World Complexity
Team Seedance, De Chen, Liyang Chen, Xin Chen, Ying Chen, Zhuo Chen, Zhuowei Chen, Feng Cheng, Tianheng Cheng, Yufeng Cheng, et al. Seedance 2.0: Advancing video generation for world complexity.arXiv preprint arXiv:2604.14148, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[49]
Sizhe Shan, Qiulin Li, Yutao Cui, Miles Yang, Yuehai Wang, Qun Yang, Jin Zhou, and Zhao Zhong. Hunyuanvideo-foley: Multimodal diffusion with representation alignment for high- fidelity foley audio generation, 2025
work page 2025
-
[50]
Msvbench: Towards human-level evaluation of multi-shot video generation
Haoyuan Shi, Yunxin Li, Nanhao Deng, Zhenran Xu, Xinyu Chen, Longyue Wang, Baotian Hu, and Min Zhang. Msvbench: Towards human-level evaluation of multi-shot video generation. arXiv preprint arXiv:2602.23969, 2026
-
[51]
SII-GAIR, Sand. ai, Ethan Chern, Hansi Teng, Hanwen Sun, Hao Wang, Hong Pan, Hongyu Jia, Jiadi Su, Jin Li, Junjie Yu, Lijie Liu, Lingzhi Li, Lyumanshan Ye, Min Hu, Qiangang Wang, Quanwei Qi, Steffi Chern, Tao Bu, Taoran Wang, Teren Xu, Tianning Zhang, Tiantian Mi, Weixian Xu, Wenqiang Zhang, Wentai Zhang, Xianping Yi, Xiaojie Cai, Xiaoyang Kang, Yan Ma, Y...
-
[52]
Make-A-Video: Text-to-Video Generation without Text-Video Data
Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, et al. Make-a-video: Text-to-video generation without text-video data.arXiv preprint arXiv:2209.14792, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[53]
Measuring style similarity in diffusion models.arXiv preprint arXiv:2404.01292,
Gowthami Somepalli, Anubhav Gupta, Kamal Gupta, Shramay Palta, Micah Goldblum, Jonas Geiping, Abhinav Shrivastava, and Tom Goldstein. Measuring style similarity in diffusion models.arXiv preprint arXiv:2404.01292, 2024
-
[54]
Tomáš Souˇcek and Jakub Lokoˇc. Transnet v2: An effective deep network architecture for fast shot transition detection.arXiv preprint arXiv:2008.04838, 2020
-
[55]
The proof and measurement of association between two things
Charles Spearman. The proof and measurement of association between two things. 1961
work page 1961
-
[56]
Mova: Towards scalable and synchronized video-audio generation.arXiv preprint arXiv:2602.08794, 2026
OpenMOSS Team, Donghua Yu, Mingshu Chen, Qi Chen, Qi Luo, Qianyi Wu, Qinyuan Cheng, Ruixiao Li, Tianyi Liang, Wenbo Zhang, et al. Mova: Towards scalable and synchronized video-audio generation.arXiv preprint arXiv:2602.08794, 2026
-
[57]
Qwen3.5: Accelerating productivity with native multimodal agents, February 2026
Qwen Team. Qwen3.5: Accelerating productivity with native multimodal agents, February 2026. 13
work page 2026
-
[58]
Silero Team. Silero V AD: pre-trained enterprise-grade V oice Activity Detector (V AD), Number Detector and Language Classifier.https://github.com/snakers4/silero-vad, 2024
work page 2024
-
[59]
The Gemini Team. Gemini 3.1 Pro. https://blog.google/innovation-and-ai/ models-and-research/gemini-models/gemini-3-1-pro/, 2026
work page 2026
-
[60]
Meta audiobox aesthetics: Unified automatic quality assessment for speech, music, and sound
Andros Tjandra, Yi-Chiao Wu, Baishan Guo, John Hoffman, Brian Ellis, Apoorv Vyas, Bowen Shi, Sanyuan Chen, Matt Le, Nick Zacharov, Carleigh Wood, Ann Lee, and Wei-Ning Hsu. Meta audiobox aesthetics: Unified automatic quality assessment for speech, music, and sound. 2025
work page 2025
-
[61]
Wan2.7.https://www.wan27.xyz/, 2026
Tongyi Wanxiang Team. Wan2.7.https://www.wan27.xyz/, 2026
work page 2026
-
[62]
Wan: Open and Advanced Large-Scale Video Generative Models
Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[63]
Kai Wang, Shijian Deng, Jing Shi, Dimitrios Hatzinakos, and Yapeng Tian. Av-dit: Effi- cient audio-visual diffusion transformer for joint audio and video generation.arXiv preprint arXiv:2406.07686, 2024
-
[64]
Wei Wang. Japanese Anime Scenes. https://www.kaggle.com/datasets/weiwangk/ japanese-anime-scenes, 2023
work page 2023
-
[65]
Univbench: Towards unified evaluation for video foundation models
Jianhui Wei, Xiaotian Zhang, Yichen Li, Yuan Wang, Yan Zhang, Ziyi Chen, Zhihang Tang, Wei Xu, and Zuozhu Liu. Univbench: Towards unified evaluation for video foundation models. arXiv preprint arXiv:2602.21835, 2026
-
[66]
Yujie Wei, Xinyu Liu, Shiwei Zhang, Hangjie Yuan, Jinbo Xing, Zhekai Chen, Xiang Wang, Haonan Qiu, Rui Zhao, Yutong Feng, et al. Dreamvideo-omni: Omni-motion controlled multi-subject video customization with latent identity reinforcement learning.arXiv preprint arXiv:2603.12257, 2026
-
[67]
Dreamvideo: Composing your dream videos with cus- tomized subject and motion
Yujie Wei, Shiwei Zhang, Zhiwu Qing, Hangjie Yuan, Zhiheng Liu, Yu Liu, Yingya Zhang, Jingren Zhou, and Hongming Shan. Dreamvideo: Composing your dream videos with cus- tomized subject and motion. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6537–6549, 2024
work page 2024
-
[68]
Dreamrelation: Relation-centric video customization
Yujie Wei, Shiwei Zhang, Hangjie Yuan, Biao Gong, Longxiang Tang, Xiang Wang, Haonan Qiu, Hengjia Li, Shuai Tan, Yingya Zhang, et al. Dreamrelation: Relation-centric video customization. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 12381–12393, 2025
work page 2025
-
[69]
Yujie Wei, Shiwei Zhang, Hangjie Yuan, Yujin Han, Zhekai Chen, Jiayu Wang, Difan Zou, Xihui Liu, Yingya Zhang, Yu Liu, et al. Routing matters in moe: Scaling diffusion transformers with explicit routing guidance.arXiv preprint arXiv:2510.24711, 2025
-
[70]
Yujie Wei, Shiwei Zhang, Hangjie Yuan, Xiang Wang, Haonan Qiu, Rui Zhao, Yutong Feng, Feng Liu, Zhizhong Huang, Jiaxin Ye, et al. Dreamvideo-2: Zero-shot subject-driven video customization with precise motion control.arXiv preprint arXiv:2410.13830, 2024
-
[71]
Tianxin Xie, Wentao Lei, Kai Jiang, Guanjie Huang, Pengfei Zhang, Chunhui Zhang, Fengji Ma, Haoyu He, Han Zhang, Jiangshan He, et al. Phyavbench: A challenging audio physics- sensitivity benchmark for physically grounded text-to-audio-video generation.arXiv preprint arXiv:2512.23994, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[72]
Kaituo Xu, Yan Jia, Kai Huang, Junjie Chen, Wenpeng Li, Kun Liu, Feng-Long Xie, Xu Tang, and Yao Hu. Fireredasr2s: A state-of-the-art industrial-grade all-in-one automatic speech recognition system.arXiv preprint arXiv:2603.10420, 2026
-
[73]
Longlive: Real-time interactive long video generation
Shuai Yang, Wei Huang, Ruihang Chu, Yicheng Xiao, Yuyang Zhao, Xianbang Wang, Muyang Li, Enze Xie, Yingcong Chen, Yao Lu, and Song Hanand Yukang Chen. Longlive: Real-time interactive long video generation. 2025. 14
work page 2025
-
[74]
Opens2v-nexus: A detailed benchmark and million-scale dataset for subject-to-video generation,
Shenghai Yuan, Xianyi He, Yufan Deng, Yang Ye, Jinfa Huang, Bin Lin, Jiebo Luo, and Li Yuan. Opens2v-nexus: A detailed benchmark and million-scale dataset for subject-to-video generation. arXiv preprint arXiv:2505.20292, 2025
-
[75]
Helios: Real real-time long video generation model.arXiv preprint arXiv:2603.04379, 2026
Shenghai Yuan, Yuanyang Yin, Zongjian Li, Xinwei Huang, Xiao Yang, and Li Yuan. Helios: Real real-time long video generation model.arXiv preprint arXiv:2603.04379, 2026
-
[76]
Uniform: A unified multi-task diffusion transformer for audio-video generation
Lei Zhao, Linfeng Feng, Dongxu Ge, Rujin Chen, Fangqiu Yi, Chi Zhang, Xiao-Lei Zhang, and Xuelong Li. Uniform: A unified multi-task diffusion transformer for audio-video generation. arXiv preprint arXiv:2502.03897, 2025
-
[77]
MTAVG-Bench: A Diagnostic Benchmark for Multi-Talker Dialogue-Centric Audio-Video Generation
Yang-Hao Zhou, Haitian Li, Rexar Lin, Heyan Huang, Jinxing Zhou, Changsen Yuan, Tian Lan, Ziqin Zhou, Yudong Li, Jiajun Xu, et al. Mtavg-bench: A comprehensive benchmark for evalu- ating multi-talker dialogue-centric audio-video generation.arXiv preprint arXiv:2602.00607, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[78]
AVGen-Bench: A Task-Driven Benchmark for Multi-Granular Evaluation of Text-to-Audio-Video Generation
Ziwei Zhou, Zeyuan Lai, Rui Wang, Yifan Yang, Zhen Xing, Yuqing Yang, Qi Dai, Lili Qiu, and Chong Luo. Avgen-bench: A task-driven benchmark for multi-granular evaluation of text-to-audio-video generation.arXiv preprint arXiv:2604.08540, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[79]
Hongzhou Zhu, Min Zhao, Guande He, Hang Su, Chongxuan Li, and Jun Zhu. Causal forcing: Autoregressive diffusion distillation done right for high-quality real-time interactive video generation, 2026
work page 2026
-
[80]
Cailin Zhuang, Ailin Huang, Yaoqi Hu, Jingwei Wu, Wei Cheng, Jiaqi Liao, Hongyuan Wang, Xinyao Liao, Weiwei Cai, Hengyuan Xu, et al. Vistorybench: Comprehensive benchmark suite for story visualization.arXiv preprint arXiv:2505.24862, 2025. 15 Appendix A More Data Details on MSA VBench 17 A.1 Data Design Details . . . . . . . . . . . . . . . . . . . . . . ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.