MSAVBench: Towards Comprehensive and Reliable Evaluation of Multi-Shot Audio-Video Generation

Difan Zou; Hongming Shan; Junjie Zhou; Junqiu Yu; Kaixun Jiang; Kai Zhu; Lingyi Hong; Quanhao Li; Ruihang Chu; Shiwei Zhang

arxiv: 2605.20183 · v4 · pith:ZYNQLIP7new · submitted 2026-05-19 · 💻 cs.CV

MSAVBench: Towards Comprehensive and Reliable Evaluation of Multi-Shot Audio-Video Generation

Yujie Wei , Yujin Han , Zhekai Chen , Yongming Li , Kaixun Jiang , Zhihang Liu , Quanhao Li , Zhiwu Qing

show 15 more authors

Xiang Wang Zhen Xing Ruihang Chu Lingyi Hong Yefei He Junjie Zhou Junqiu Yu Yang Shi Difan Zou Kai Zhu Shiwei Zhang Yingya Zhang Yu Liu Xihui Liu Hongming Shan

This is my paper

Pith reviewed 2026-06-30 17:58 UTC · model grok-4.3

classification 💻 cs.CV

keywords multi-shot video generationaudio-video evaluationbenchmarkhuman alignmentvideo quality assessmentshot segmentationmultimodal generation

0 comments

The pith

MSAVBench introduces the first benchmark and evaluation framework for multi-shot audio-video generation that reaches 91.5 percent Spearman correlation with human judgments.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MSAVBench as a benchmark and adaptive hybrid evaluation framework to assess multi-shot audio-video generation models. It covers four dimensions—video, audio, shot, and reference—across tasks with up to 15 shots and non-realistic scenarios. The framework uses adaptive self-correction for shot segmentation, instance-wise rubrics, and tool-grounded evidence extraction to improve robustness. Systematic testing of 19 models reveals persistent difficulties with director-level control and audio-visual synchronization, while modular pipelines appear to narrow performance gaps between open- and closed-source systems.

Core claim

MSAVBench is the first comprehensive benchmark and adaptive hybrid evaluation framework for multi-shot audio-video generation. It spans video, audio, shot, and reference dimensions with diverse settings up to 15 shots and challenging scenarios. The framework incorporates an adaptive self-correction mechanism for shot segmentation, instance-wise rubrics for subjective metrics, and tool-grounded evidence extraction, achieving a Spearman rank correlation of 91.5 percent with human judgments. Evaluation of 19 state-of-the-art models indicates that current systems struggle with director-level control and fine-grained audio-visual synchronization, while modular or agentic pipelines show promise in

What carries the argument

MSAVBench benchmark and adaptive hybrid evaluation framework, which applies adaptive self-correction for shot segmentation, instance-wise rubrics, and tool-grounded evidence extraction to produce scores aligned with human judgments.

If this is right

Current models require advances in director-level control over shot sequencing and narrative structure.
Fine-grained audio-visual synchronization remains an open limitation even in top-performing systems.
Modular and agentic generation pipelines provide a concrete route to close the gap between open-source and closed-source performance.
Evaluation must account for varying shot counts and non-realistic content to remain relevant to real-world demands.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Future model training could incorporate the benchmark's rubrics directly as reward signals during reinforcement learning.
Extending the benchmark to longer narratives beyond 15 shots would test whether current limitations scale with complexity.
The framework's tool-grounded extraction may generalize to other multimodal generation tasks such as text-to-3D or interactive video.

Load-bearing premise

The adaptive self-correction, instance-wise rubrics, and tool-grounded extraction together produce robust unbiased scores that fully capture the quality of multi-shot audio-video outputs.

What would settle it

A new model or pipeline that receives high automated scores on MSAVBench yet receives consistently lower rankings from human raters on the same outputs, or vice versa.

Figures

Figures reproduced from arXiv: 2605.20183 by Difan Zou, Hongming Shan, Junjie Zhou, Junqiu Yu, Kaixun Jiang, Kai Zhu, Lingyi Hong, Quanhao Li, Ruihang Chu, Shiwei Zhang, Xiang Wang, Xihui Liu, Yang Shi, Yefei He, Yingya Zhang, Yongming Li, Yujie Wei, Yujin Han, Yu Liu, Zhekai Chen, Zhen Xing, Zhihang Liu, Zhiwu Qing.

**Figure 1.** Figure 1: Overview of MSAVBench. Left: the benchmark spans four data dimensions, namely video, audio, shot, and reference, covering diverse prompts, shot counts, and realistic and non-realistic scenarios. Right: the evaluation suite assesses generated MSAV content at four levels, including global, cross-shot, intra-shot, and reference levels, using a hybrid strategy that combines specialized expert models, rubric-ba… view at source ↗

**Figure 2.** Figure 2: Diverse distribution of MSAVBench. The benchmark covers diverse generation categories [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of the MSAVBench evaluation framework. We first perform agentic preprocessing with iterative shot self-correction to improve boundary quality. Metrics are then evaluated with stratified scoring paradigms, including expert models for well-defined tasks, rubric-based VLM scoring for subjective dimensions, and tool-grounded agentic scoring for complex properties. Cross-shot-level metrics. These metr… view at source ↗

**Figure 4.** Figure 4: Qualitative failure cases of evaluated models. Examples include text rendering errors (A), counterfactual subject mismatches (B), audio-visual synchronization failures (C), layout control failures (D), and incorrect subject counts (E). Finding 2: Compared to basic audio-visual fidelity, open-source models lag significantly behind closed systems in “director-level” structural control and cinematic language.… view at source ↗

**Figure 5.** Figure 5: The data construction pipeline of MSAVBench. (1) Domain experts define an eightcategory seed taxonomy with fine-grained sub-categories, with diverse types of subject, scene, and visual style. (2) GPT-5.4 first samples (theme, subject, scene, style) quadruples and synthesises an initial multi-shot script with structured per-shot metadata; a Prompt-Enhancement (PE) model then rewrites it into the global-to-… view at source ↗

**Figure 6.** Figure 6: Long-tail cinematic-language and tonal distributions of MSAVBench. Shot scale, camera angle, transition type and tone×saturation distributions on the released 286-prompt suite. transitions span 4 major types (hard cut 66.9%, dissolve 18.7%, none 13.0%, match cut / fade 1.4%); and lighting is reported with 5 major types (natural, side, soft, neon, low-key). The distributions are plotted in [PITH_FULL_IMAGE… view at source ↗

**Figure 7.** Figure 7: Screenshot of the annotation interface used for pairwise expert evaluation. [PITH_FULL_IMAGE:figures/full_fig_p024_7.png] view at source ↗

read the original abstract

Video generation is rapidly evolving from single-shot synthesis to complex multi-shot audio-video (MSAV) narratives to meet real-world demands. However, evaluating such frontier models remains a fundamental challenge. Existing benchmarks are limited in scope and data diversity, and rely on rigid evaluation pipelines, preventing systematic and reliable assessment of modern MSAV models. To bridge these gaps, we introduce MSAVBench, the first comprehensive benchmark and adaptive hybrid evaluation framework for multi-shot audio-video generation. Our benchmark spans four key dimensions, video, audio, shot, and reference, covering diverse task settings, varying shot counts of up to 15, and challenging non-realistic scenarios. Our evaluation framework improves robustness through an adaptive self-correction mechanism for shot segmentation, instance-wise rubrics for subjective metrics, and tool-grounded evidence extraction for complex judgments. Furthermore, MSAVBench achieves high alignment with human judgments, reaching a Spearman rank correlation of 91.5%. Our systematic evaluation of 19 state-of-the-art closed- and open-source models shows that current systems still struggle with director-level control and fine-grained audio-visual synchronization, while modular or agentic generation pipelines offer a promising path toward narrowing the gap between open- and closed-source models. The benchmark data and evaluation code are publicly available at https://github.com/ali-vilab/MSAVBench.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MSAVBench adds a needed benchmark for multi-shot audio-video with public code and a reported 91.5% human correlation, though the adaptive framework's details need checking.

read the letter

The paper's main contribution is MSAVBench, a benchmark built specifically for multi-shot audio-video generation. It covers four dimensions (video, audio, shot, reference), handles up to 15 shots, includes non-realistic scenarios, and comes with an adaptive hybrid evaluation setup that claims 91.5% Spearman correlation to human judgments. They also evaluate 19 models and release the data and code.

What stands out is the scope: earlier benchmarks were narrower and used more rigid pipelines, so this fills a clear gap for current frontier models. The systematic comparison is straightforward and highlights real weaknesses in director-level control and audio-visual synchronization, while noting that modular or agentic pipelines perform better. Public release of the benchmark makes it immediately usable.

The correlation result is the load-bearing claim. The adaptive self-correction, instance-wise rubrics, and tool-grounded extraction are presented as improvements, but without the full paper it is hard to judge how well they avoid selection bias or inconsistent scoring. The model rankings follow standard practice and do not appear circular.

This is for researchers working on video generation who need better ways to measure progress on narrative, multi-shot outputs. It is not a foundational theoretical advance but supplies concrete tooling and data that the subfield can use.

The work is coherent on its own terms and deserves peer review so the evaluation details can be examined.

Referee Report

3 major / 2 minor

Summary. The paper introduces MSAVBench, the first comprehensive benchmark and adaptive hybrid evaluation framework for multi-shot audio-video (MSAV) generation. It covers four dimensions (video, audio, shot, reference) with diverse tasks, up to 15 shots, and non-realistic scenarios. The framework incorporates adaptive self-correction for shot segmentation, instance-wise rubrics for subjective metrics, and tool-grounded evidence extraction. It reports a 91.5% Spearman rank correlation with human judgments and evaluates 19 closed- and open-source models, concluding that current systems struggle with director-level control and fine-grained audio-visual synchronization while modular/agentic pipelines show promise. Benchmark data and code are released publicly.

Significance. If the human-alignment claim and robustness mechanisms hold, MSAVBench would fill a critical gap by providing a standardized, multi-dimensional evaluation protocol for an emerging class of complex generative models, enabling reproducible comparisons and highlighting actionable research directions such as improved synchronization. The public release of data and code strengthens its potential impact.

major comments (3)

[Abstract and §3] Abstract and §3 (Evaluation Framework): The central claim of 91.5% Spearman correlation with human judgments is load-bearing, yet the manuscript provides no quantitative details on the number of raters, inter-rater reliability (e.g., Krippendorff’s alpha), or the exact protocol for collecting human scores; without these, it is impossible to assess whether the reported alignment is robust or sensitive to annotation variance.
[§4 and §5] §4 (Benchmark Construction) and §5 (Experiments): The adaptive self-correction mechanism, instance-wise rubrics, and tool-grounded extraction are presented as key robustness improvements, but no ablation study isolates their contribution to the final correlation score; this leaves open whether the high alignment depends on these components or would be achieved by simpler fixed rubrics.
[Table 2] Table 2 (Model Evaluation Results): The claim that modular/agentic pipelines narrow the open- vs. closed-source gap is supported only by aggregate rankings; without per-dimension breakdowns (video/audio/shot/reference) or statistical significance tests on the observed differences, the conclusion that these pipelines are “promising” remains under-supported.

minor comments (2)

The GitHub repository link is given, but the manuscript should explicitly state the commit hash or version tag used for all reported numbers to ensure exact reproducibility.
[§2] Notation for the four evaluation dimensions is introduced in the abstract but should be formalized with a table or diagram early in §2 to aid readers.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. Below we provide point-by-point responses to the major comments.

read point-by-point responses

Referee: [Abstract and §3] Abstract and §3 (Evaluation Framework): The central claim of 91.5% Spearman correlation with human judgments is load-bearing, yet the manuscript provides no quantitative details on the number of raters, inter-rater reliability (e.g., Krippendorff’s alpha), or the exact protocol for collecting human scores; without these, it is impossible to assess whether the reported alignment is robust or sensitive to annotation variance.

Authors: We agree with the referee that these methodological details are crucial and were insufficiently described. In the revised manuscript, we will provide quantitative details on the number of raters, inter-rater reliability (using Krippendorff’s alpha), and the exact human scoring protocol in an expanded Section 3. revision: yes
Referee: [§4 and §5] §4 (Benchmark Construction) and §5 (Experiments): The adaptive self-correction mechanism, instance-wise rubrics, and tool-grounded extraction are presented as key robustness improvements, but no ablation study isolates their contribution to the final correlation score; this leaves open whether the high alignment depends on these components or would be achieved by simpler fixed rubrics.

Authors: We concur that an ablation study would help clarify the contribution of each robustness mechanism. We will include such an ablation in the revised §5, comparing the full framework against variants with fixed rubrics and without self-correction. revision: yes
Referee: [Table 2] Table 2 (Model Evaluation Results): The claim that modular/agentic pipelines narrow the open- vs. closed-source gap is supported only by aggregate rankings; without per-dimension breakdowns (video/audio/shot/reference) or statistical significance tests on the observed differences, the conclusion that these pipelines are “promising” remains under-supported.

Authors: We accept that the support for the claim regarding modular/agentic pipelines can be strengthened. In the revision, we will add per-dimension breakdowns to Table 2 and perform statistical significance tests on the differences observed. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper introduces an empirical benchmark (MSAVBench) and reports a direct Spearman rank correlation of 91.5% between its hybrid evaluation framework and human judgments. This is a measured outcome from human studies, not a derived quantity obtained by fitting parameters to the target result or by self-referential definitions. No equations, fitted inputs renamed as predictions, or load-bearing self-citations appear in the provided text that would reduce the alignment claim to a tautology. The described components (adaptive self-correction, instance-wise rubrics, tool-grounded extraction) are presented as methodological improvements whose validity is assessed externally via the human correlation; the evaluation of 19 models follows standard benchmark practice without internal reduction to the inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Benchmark paper that relies on the domain assumption that human judgments constitute reliable ground truth and that the chosen dimensions and adaptive mechanisms adequately represent multi-shot generation quality.

axioms (1)

domain assumption Human judgments serve as the ground truth for validating automated evaluation metrics.
The 91.5% Spearman correlation is presented as evidence of framework quality.

pith-pipeline@v0.9.1-grok · 5854 in / 1367 out tokens · 44113 ms · 2026-06-30T17:58:25.366047+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

80 extracted references · 39 canonical work pages · 16 internal anchors

[1]

PaddleOCR 3.0 Technical Report, author=Cheng Cui and Ting Sun and Manhui Lin and Tingquan Gao and Yubo Zhang and Jiaxuan Liu and Xueqing Wang and Zelun Zhang and Changda Zhou and Hongen Liu and Yue Zhang and Wenyu Lv and Kui Huang and Yichao Zhang and Jing Zhang and Jun Zhang and Yi Liu and Dianhai Yu and Yanjun Ma, 2025

2025
[2]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Blazepose: On-device real-time body pose tracking.arXiv preprint arXiv:2006.10204, 2020

Valentin Bazarevsky, Ivan Grishchenko, Karthik Raveendran, Tyler Zhu, Fan Zhang, and Matthias Grundmann. Blazepose: On-device real-time body pose tracking.arXiv preprint arXiv:2006.10204, 2020

work page arXiv 2006
[4]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Do- minik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[5]

Video generation models as world simulators

Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Leo Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, et al. Video generation models as world simulators. OpenAI Blog, 1(8):1, 2024

2024
[6]

Kevin Cai, Chonghua Liu, and David M. Chan. Anim-400k: A large-scale dataset for automated end to end dubbing of video. InICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2024

2024
[7]

T2AV-Compass: Towards Unified Evaluation for Text-to-Audio-Video Generation

Zhe Cao, Tao Wang, Jiaming Wang, Yanghai Wang, Yuanxing Zhang, Jialu Chen, Miao Deng, Jiahao Wang, Yubin Guo, Chenxi Liao, et al. T2av-compass: Towards unified evaluation for text-to-audio-video generation.arXiv preprint arXiv:2512.21094, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[8]

Talkvid: A large-scale diversified dataset for audio-driven talking head synthesis, 2025

Shunian Chen, Hejin Huang, Yexin Liu, Zihan Ye, Pengcheng Chen, Chenghao Zhu, Michael Guan, Rongsheng Wang, Junying Chen, Guanbin Li, Ser-Nam Lim, Harry Yang, and Benyou Wang. Talkvid: A large-scale diversified dataset for audio-driven talking head synthesis, 2025

2025
[9]

MMAudio: Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis, 2025

Ho Kei Cheng, Masato Ishii, Akio Hayakawa, Takashi Shibuya, Alexander Schwing, and Yuki Mitsufuji. MMAudio: Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis, 2025. 10

2025
[10]

W2v-bert: Combining contrastive learning and masked language modeling for self-supervised speech pre-training

Yu-An Chung, Yu Zhang, Wei Han, Chung-Cheng Chiu, James Qin, Ruoming Pang, and Yonghui Wu. W2v-bert: Combining contrastive learning and masked language modeling for self-supervised speech pre-training. In2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pages 244–250. IEEE, 2021

2021
[11]

Paddleocr-vl-1.5: Towards a multi-task 0.9b vlm for robust in-the-wild document parsing, 2026

Cheng Cui, Ting Sun, Suyin Liang, Tingquan Gao, Zelun Zhang, Jiaxuan Liu, Xueqing Wang, Changda Zhou, Hongen Liu, Manhui Lin, Yue Zhang, Yubo Zhang, Yi Liu, Dianhai Yu, and Yanjun Ma. Paddleocr-vl-1.5: Towards a multi-task 0.9b vlm for robust in-the-wild document parsing, 2026

2026
[12]

PaddleOCR-VL: Boosting Multilingual Document Parsing via a 0.9B Ultra-Compact Vision-Language Model, 2025

Cheng Cui, Ting Sun, Suyin Liang, Tingquan Gao, Zelun Zhang, Jiaxuan Liu, Xueqing Wang, Changda Zhou, Hongen Liu, Manhui Lin, Yue Zhang, Yubo Zhang, Handong Zheng, Jing Zhang, Jun Zhang, Yi Liu, Dianhai Yu, and Yanjun Ma. PaddleOCR-VL: Boosting Multilingual Document Parsing via a 0.9B Ultra-Compact Vision-Language Model, 2025

2025
[13]

Demucs: Deep Extractor for Music Sources with extra unlabeled data remixed

Alexandre Défossez, Nicolas Usunier, Léon Bottou, and Francis Bach. Demucs: Deep extractor for music sources with extra unlabeled data remixed.arXiv preprint arXiv:1909.01174, 2019

work page Pith review arXiv 1909
[14]

Arcface: Additive angular margin loss for deep face recognition

Jiankang Deng, Jia Guo, Niannan Xue, and Stefanos Zafeiriou. Arcface: Additive angular margin loss for deep face recognition. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4690–4699, 2019

2019
[15]

Scaling rectified flow trans- formers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow trans- formers for high-resolution image synthesis. InForty-first international conference on machine learning, 2024

2024
[16]

Veo 3.1.https://deepmind.google/technologies/veo/, 2026

Google DeepMind. Veo 3.1.https://deepmind.google/technologies/veo/, 2026

2026
[17]

Audcast: Audio-driven human video generation by cascaded diffusion transformers

Jiazhi Guan, Kaisiyuan Wang, Zhiliang Xu, Quanwei Yang, Yasheng Sun, Shengyi He, Borong Liang, Yukang Cao, Yingying Li, Haocheng Feng, et al. Audcast: Audio-driven human video generation by cascaded diffusion transformers. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10678–10689, 2025

2025
[18]

Dreamid-omni: Unified framework for controllable human-centric audio-video generation.arXiv preprint arXiv:2602.12160, 2026

Xu Guo, Fulong Ye, Qichao Sun, Liyang Chen, Bingchuan Li, Pengze Zhang, Jiawei Liu, Song- tao Zhao, Qian He, and Xiangwang Hou. Dreamid-omni: Unified framework for controllable human-centric audio-video generation.arXiv preprint arXiv:2602.12160, 2026

work page arXiv 2026
[19]

LTX-Video: Realtime Video Latent Diffusion

Yoav HaCohen, Nisan Chiprut, Benny Brazowski, Daniel Shalem, Dudu Moshe, Eitan Richard- son, Eran Levin, Guy Shiran, Nir Zabari, Ori Gordon, Poriya Panet, Sapir Weissbuch, Victor Kulikov, Yaki Bitterman, Zeev Melumian, and Ofir Bibi. Ltx-video: Realtime video latent diffusion.arXiv preprint arXiv:2501.00103, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[20]

Video-bench: Human-aligned video generation benchmark

Hui Han, Siyuan Li, Jiaqi Chen, Yiwen Yuan, Yuling Wu, Yufan Deng, Chak Tou Leong, Hanwen Du, Junchen Fu, Youhua Li, et al. Video-bench: Human-aligned video generation benchmark. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 18858–18868, 2025

2025
[21]

AesRM: Improving Video Aesthetics with Expert-Level Feedback

Yujin Han, Yujie Wei, Yefei He, Xinyu Liu, Tianle Li, Zichao Yu, Andi Han, Shiwei Zhang, Tingyu Weng, and Difan Zou. Aesrm: Improving video aesthetics with expert-level feedback. arXiv preprint arXiv:2604.28078, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[22]

HappyHorse 1.0.https://www.happyhorse.cn/, 2026

HappyHorse Team. HappyHorse 1.0.https://www.happyhorse.cn/, 2026

2026
[23]

Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

2020
[24]

Video diffusion models.Advances in neural information processing systems, 35:8633–8646, 2022

Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video diffusion models.Advances in neural information processing systems, 35:8633–8646, 2022

2022
[25]

VABench: A Comprehensive Benchmark for Audio-Video Generation

Daili Hua, Xizhi Wang, Bohan Zeng, Xinyi Huang, Hao Liang, Junbo Niu, Xinlong Chen, Quanqing Xu, and Wentao Zhang. Vabench: A comprehensive benchmark for audio-video generation.arXiv preprint arXiv:2512.09299, 2025. 11

work page internal anchor Pith review Pith/arXiv arXiv 2025
[26]

Self forcing: Bridging the train-test gap in autoregressive video diffusion, 2025

Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman. Self forcing: Bridging the train-test gap in autoregressive video diffusion, 2025

2025
[27]

Vbench: Comprehensive benchmark suite for video generative models

Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive benchmark suite for video generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21807–21818, 2024

2024
[28]

Synchformer: Efficient synchronization from sparse cues

Vladimir Iashin, Weidi Xie, Esa Rahtu, and Andrew Zisserman. Synchformer: Efficient synchronization from sparse cues. InICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5325–5329. IEEE, 2024

2024
[29]

All-in-one metrical and functional structure analysis with neigh- borhood attentions on demixed audio

Taejun Kim and Juhan Nam. All-in-one metrical and functional structure analysis with neigh- borhood attentions on demixed audio. InIEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 2023

2023
[30]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[31]

Kling 3.0.https://klingai.com/global/, 2026

Kuaishou Technology. Kling 3.0.https://klingai.com/global/, 2026

2026
[32]

arXiv preprint arXiv:2412.09262 (2024)

Chunyu Li, Chao Zhang, Weikai Xu, Jingyu Lin, Jinghui Xie, Weiguo Feng, Bingyue Peng, Cunjian Chen, and Weiwei Xing. Latentsync: Taming audio-conditioned latent diffusion models for lip sync with syncnet supervision.arXiv preprint arXiv:2412.09262, 2024

work page arXiv 2024
[33]

Lr-asd: Lightweight and robust network for active speaker detection.International Journal of Computer Vision, 133(7):4749–4769, 2025

Junhua Liao, Haihan Duan, Kanghui Feng, Wanbing Zhao, Yanbing Yang, Liangyin Chen, and Yanru Chen. Lr-asd: Lightweight and robust network for active speaker detection.International Journal of Computer Vision, 133(7):4749–4769, 2025

2025
[34]

Aibench: Evaluating visual-logical consistency in academic illustration generation.arXiv preprint arXiv:2603.28068, 2026

Zhaohe Liao, Kaixun Jiang, Zhihang Liu, Yujie Wei, Junqiu Yu, Quanhao Li, Hong-Tao Yu, Pandeng Li, Yuzheng Wang, Zhen Xing, et al. Aibench: Evaluating visual-logical consistency in academic illustration generation.arXiv preprint arXiv:2603.28068, 2026

work page arXiv 2026
[35]

Javisgpt: A unified multi-modal llm for sounding-video comprehension and generation

Kai Liu, Jungang Li, Yuchong Sun, Shengqiong Wu, Jianzhang Gao, Daoan Zhang, Wei Zhang, Sheng Jin, Sicheng Yu, Geng Zhan, Jiayi Ji, Fan Zhou, Liang Zheng, Shuicheng Y AN, Hao Fei, and Tat-Seng Chua. Javisgpt: A unified multi-modal llm for sounding-video comprehension and generation. InThe Thirty-ninth Annual Conference on Neural Information Processing Sys...

2025
[36]

Javisdit++: Unified modeling and optimization for joint audio-video generation

Kai Liu, Yanhao Zheng, Kai Wang, Shengqiong Wu, Rongjunchen Zhang, Jiebo Luo, Dimitrios Hatzinakos, Ziwei Liu, Hao Fei, and Tat-Seng Chua. Javisdit++: Unified modeling and optimization for joint audio-video generation. InThe Fourteenth International Conference on Learning Representations, 2026

2026
[37]

Grounding dino: Marrying dino with grounded pre-training for open-set object detection

Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. InEuropean conference on computer vision, pages 38–55. Springer, 2024

2024
[38]

Evalcrafter: Benchmarking and evaluating large video generation models.arXiv preprint arXiv:2310.11440, 2023

Yaofang Liu, Xiaodong Cun, Xuebo Liu, Xintao Wang, Yong Zhang, Haoxin Chen, Yang Liu, Tieyong Zeng, Raymond Chan, and Ying Shan. Evalcrafter: Benchmarking and evaluating large video generation models.arXiv preprint arXiv:2310.11440, 2023

work page arXiv 2023
[39]

Shotstream: Streaming multi-shot video generation for inter- active storytelling.arXiv preprint arXiv:2603.25746, 2026

Yawen Luo, Xiaoyu Shi, Junhao Zhuang, Yutian Chen, Quande Liu, Xintao Wang, Pengfei Wan, and Tianfan Xue. Shotstream: Streaming multi-shot video generation for interactive storytelling. arXiv preprint arXiv:2603.25746, 2026

work page arXiv 2026
[40]

Wan-Image: Pushing the Boundaries of Generative Visual Intelligence

Chaojie Mao, Chen-Wei Xie, Chongyang Zhong, Haoyou Deng, Jiaxing Zhao, Jie Xiao, Jinbo Xing, Jingfeng Zhang, Jingren Zhou, Jingyi Zhang, et al. Wan-image: Pushing the boundaries of generative visual intelligence.arXiv preprint arXiv:2604.19858, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[41]

OenAI. GPT-5.4. https://openai.com/zh-Hans-CN/index/introducing-gpt-5-4/ , 2026. 12

2026
[42]

Sora 2.https://openai.com/index/sora-2/, 2025

OpenAI. Sora 2.https://openai.com/index/sora-2/, 2025

2025
[43]

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[44]

and Balam, Jagadeesh and Ginsburg, Boris , month = dec, year =

Taejin Park, Ivan Medennikov, Kunal Dhawan, Weiqing Wang, He Huang, Nithin Rao Koluguri, Krishna C Puvvada, Jagadeesh Balam, and Boris Ginsburg. Sortformer: A novel approach for permutation-resolved speaker supervision in speech-to-text systems.arXiv preprint arXiv:2409.06656, 2024

work page arXiv 2024
[45]

Movie Gen: A Cast of Media Foundation Models

Adam Polyak, Amit Zohar, Andrew Brown, Andros Tjandra, Animesh Sinha, Ann Lee, Apoorv Vyas, Bowen Shi, Chih-Yao Ma, Ching-Yao Chuang, et al. Movie gen: A cast of media foundation models.arXiv preprint arXiv:2410.13720, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[46]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021

2021
[47]

Robust speech recognition via large-scale weak supervision

Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision. InInternational conference on machine learning, pages 28492–28518. PMLR, 2023

2023
[48]

Seedance 2.0: Advancing Video Generation for World Complexity

Team Seedance, De Chen, Liyang Chen, Xin Chen, Ying Chen, Zhuo Chen, Zhuowei Chen, Feng Cheng, Tianheng Cheng, Yufeng Cheng, et al. Seedance 2.0: Advancing video generation for world complexity.arXiv preprint arXiv:2604.14148, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[49]

Hunyuanvideo-foley: Multimodal diffusion with representation alignment for high- fidelity foley audio generation, 2025

Sizhe Shan, Qiulin Li, Yutao Cui, Miles Yang, Yuehai Wang, Qun Yang, Jin Zhou, and Zhao Zhong. Hunyuanvideo-foley: Multimodal diffusion with representation alignment for high- fidelity foley audio generation, 2025

2025
[50]

Msvbench: Towards human-level evaluation of multi-shot video generation

Haoyuan Shi, Yunxin Li, Nanhao Deng, Zhenran Xu, Xinyu Chen, Longyue Wang, Baotian Hu, and Min Zhang. Msvbench: Towards human-level evaluation of multi-shot video generation. arXiv preprint arXiv:2602.23969, 2026

work page arXiv 2026
[51]

SII-GAIR, Sand. ai, Ethan Chern, Hansi Teng, Hanwen Sun, Hao Wang, Hong Pan, Hongyu Jia, Jiadi Su, Jin Li, Junjie Yu, Lijie Liu, Lingzhi Li, Lyumanshan Ye, Min Hu, Qiangang Wang, Quanwei Qi, Steffi Chern, Tao Bu, Taoran Wang, Teren Xu, Tianning Zhang, Tiantian Mi, Weixian Xu, Wenqiang Zhang, Wentai Zhang, Xianping Yi, Xiaojie Cai, Xiaoyang Kang, Yan Ma, Y...

work page arXiv 2026
[52]

Make-A-Video: Text-to-Video Generation without Text-Video Data

Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, et al. Make-a-video: Text-to-video generation without text-video data.arXiv preprint arXiv:2209.14792, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[53]

arXiv preprint arXiv:2404.01292 (2024)

Gowthami Somepalli, Anubhav Gupta, Kamal Gupta, Shramay Palta, Micah Goldblum, Jonas Geiping, Abhinav Shrivastava, and Tom Goldstein. Measuring style similarity in diffusion models.arXiv preprint arXiv:2404.01292, 2024

work page arXiv 2024
[54]

Transnet v2: An ef- fective deep network architecture for fast shot transition detection

Tomáš Souˇcek and Jakub Lokoˇc. Transnet v2: An effective deep network architecture for fast shot transition detection.arXiv preprint arXiv:2008.04838, 2020

work page arXiv 2008
[55]

The proof and measurement of association between two things

Charles Spearman. The proof and measurement of association between two things. 1961

1961
[56]

Mova: Towards scalable and synchronized video-audio generation.arXiv preprint arXiv:2602.08794, 2026

OpenMOSS Team, Donghua Yu, Mingshu Chen, Qi Chen, Qi Luo, Qianyi Wu, Qinyuan Cheng, Ruixiao Li, Tianyi Liang, Wenbo Zhang, et al. Mova: Towards scalable and synchronized video-audio generation.arXiv preprint arXiv:2602.08794, 2026

work page arXiv 2026
[57]

Qwen3.5: Accelerating productivity with native multimodal agents, February 2026

Qwen Team. Qwen3.5: Accelerating productivity with native multimodal agents, February 2026. 13

2026
[58]

Silero V AD: pre-trained enterprise-grade V oice Activity Detector (V AD), Number Detector and Language Classifier.https://github.com/snakers4/silero-vad, 2024

Silero Team. Silero V AD: pre-trained enterprise-grade V oice Activity Detector (V AD), Number Detector and Language Classifier.https://github.com/snakers4/silero-vad, 2024

2024
[59]

Gemini 3.1 Pro

The Gemini Team. Gemini 3.1 Pro. https://blog.google/innovation-and-ai/ models-and-research/gemini-models/gemini-3-1-pro/, 2026

2026
[60]

Meta audiobox aesthetics: Unified automatic quality assessment for speech, music, and sound

Andros Tjandra, Yi-Chiao Wu, Baishan Guo, John Hoffman, Brian Ellis, Apoorv Vyas, Bowen Shi, Sanyuan Chen, Matt Le, Nick Zacharov, Carleigh Wood, Ann Lee, and Wei-Ning Hsu. Meta audiobox aesthetics: Unified automatic quality assessment for speech, music, and sound. 2025

2025
[61]

Wan2.7.https://www.wan27.xyz/, 2026

Tongyi Wanxiang Team. Wan2.7.https://www.wan27.xyz/, 2026

2026
[62]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[63]

AV-DiT: Efficient Audio-Visual Diffusion Transformer for Joint Audio and Video Generation

Kai Wang, Shijian Deng, Jing Shi, Dimitrios Hatzinakos, and Yapeng Tian. Av-dit: Effi- cient audio-visual diffusion transformer for joint audio and video generation.arXiv preprint arXiv:2406.07686, 2024

work page Pith review arXiv 2024
[64]

Japanese Anime Scenes

Wei Wang. Japanese Anime Scenes. https://www.kaggle.com/datasets/weiwangk/ japanese-anime-scenes, 2023

2023
[65]

arXiv preprint arXiv:2602.21835 , year=

Jianhui Wei, Xiaotian Zhang, Yichen Li, Yuan Wang, Yan Zhang, Ziyi Chen, Zhihang Tang, Wei Xu, and Zuozhu Liu. Univbench: Towards unified evaluation for video foundation models. arXiv preprint arXiv:2602.21835, 2026

work page arXiv 2026
[66]

Dreamvideo-omni: Omni-motion controlled multi-subject video customization with latent identity reinforcement learning.arXiv preprint arXiv:2603.12257, 2026

Yujie Wei, Xinyu Liu, Shiwei Zhang, Hangjie Yuan, Jinbo Xing, Zhekai Chen, Xiang Wang, Haonan Qiu, Rui Zhao, Yutong Feng, et al. Dreamvideo-omni: Omni-motion controlled multi-subject video customization with latent identity reinforcement learning.arXiv preprint arXiv:2603.12257, 2026

work page arXiv 2026
[67]

Dreamvideo: Composing your dream videos with cus- tomized subject and motion

Yujie Wei, Shiwei Zhang, Zhiwu Qing, Hangjie Yuan, Zhiheng Liu, Yu Liu, Yingya Zhang, Jingren Zhou, and Hongming Shan. Dreamvideo: Composing your dream videos with cus- tomized subject and motion. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6537–6549, 2024

2024
[68]

Dreamrelation: Relation-centric video customization

Yujie Wei, Shiwei Zhang, Hangjie Yuan, Biao Gong, Longxiang Tang, Xiang Wang, Haonan Qiu, Hengjia Li, Shuai Tan, Yingya Zhang, et al. Dreamrelation: Relation-centric video customization. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 12381–12393, 2025

2025
[69]

arXiv preprint arXiv:2510.24711 (2025)

Yujie Wei, Shiwei Zhang, Hangjie Yuan, Yujin Han, Zhekai Chen, Jiayu Wang, Difan Zou, Xihui Liu, Yingya Zhang, Yu Liu, et al. Routing matters in moe: Scaling diffusion transformers with explicit routing guidance.arXiv preprint arXiv:2510.24711, 2025

work page arXiv 2025
[70]

Dreamvideo-2: Zero-shot subject-driven video customization with precise motion control,

Yujie Wei, Shiwei Zhang, Hangjie Yuan, Xiang Wang, Haonan Qiu, Rui Zhao, Yutong Feng, Feng Liu, Zhizhong Huang, Jiaxin Ye, et al. Dreamvideo-2: Zero-shot subject-driven video customization with precise motion control.arXiv preprint arXiv:2410.13830, 2024

work page arXiv 2024
[71]

PhyAVBench: A Challenging Audio Physics-Sensitivity Benchmark for Physically Grounded Text-to-Audio-Video Generation

Tianxin Xie, Wentao Lei, Kai Jiang, Guanjie Huang, Pengfei Zhang, Chunhui Zhang, Fengji Ma, Haoyu He, Han Zhang, Jiangshan He, et al. Phyavbench: A challenging audio physics- sensitivity benchmark for physically grounded text-to-audio-video generation.arXiv preprint arXiv:2512.23994, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[72]

Fireredasr2s: A state-of-the-art industrial-grade all-in-one automatic speech recognition system.arXiv preprint arXiv:2603.10420, 2026

Kaituo Xu, Yan Jia, Kai Huang, Junjie Chen, Wenpeng Li, Kun Liu, Feng-Long Xie, Xu Tang, and Yao Hu. Fireredasr2s: A state-of-the-art industrial-grade all-in-one automatic speech recognition system.arXiv preprint arXiv:2603.10420, 2026

work page arXiv 2026
[73]

Longlive: Real-time interactive long video generation

Shuai Yang, Wei Huang, Ruihang Chu, Yicheng Xiao, Yuyang Zhao, Xianbang Wang, Muyang Li, Enze Xie, Yingcong Chen, Yao Lu, and Song Hanand Yukang Chen. Longlive: Real-time interactive long video generation. 2025. 14

2025
[74]

Opens2v-nexus: A detailed benchmark and million-scale dataset for subject-to-video generation.arXivpreprint arXiv:2505.20292, 2025

Shenghai Yuan, Xianyi He, Yufan Deng, Yang Ye, Jinfa Huang, Bin Lin, Jiebo Luo, and Li Yuan. Opens2v-nexus: A detailed benchmark and million-scale dataset for subject-to-video generation. arXiv preprint arXiv:2505.20292, 2025

work page arXiv 2025
[75]

Helios: Real real-time long video generation model.arXiv preprint arXiv:2603.04379, 2026

Shenghai Yuan, Yuanyang Yin, Zongjian Li, Xinwei Huang, Xiao Yang, and Li Yuan. Helios: Real real-time long video generation model.arXiv preprint arXiv:2603.04379, 2026

work page arXiv 2026
[76]

Uniform: A unified multi-task diffusion transformer for audio-video generation.arXiv preprint arXiv:2502.03897,

Lei Zhao, Linfeng Feng, Dongxu Ge, Rujin Chen, Fangqiu Yi, Chi Zhang, Xiao-Lei Zhang, and Xuelong Li. Uniform: A unified multi-task diffusion transformer for audio-video generation. arXiv preprint arXiv:2502.03897, 2025

work page arXiv 2025
[77]

MTAVG-Bench: A Diagnostic Benchmark for Multi-Talker Dialogue-Centric Audio-Video Generation

Yang-Hao Zhou, Haitian Li, Rexar Lin, Heyan Huang, Jinxing Zhou, Changsen Yuan, Tian Lan, Ziqin Zhou, Yudong Li, Jiajun Xu, et al. Mtavg-bench: A comprehensive benchmark for evalu- ating multi-talker dialogue-centric audio-video generation.arXiv preprint arXiv:2602.00607, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[78]

AVGen-Bench: A Task-Driven Benchmark for Multi-Granular Evaluation of Text-to-Audio-Video Generation

Ziwei Zhou, Zeyuan Lai, Rui Wang, Yifan Yang, Zhen Xing, Yuqing Yang, Qi Dai, Lili Qiu, and Chong Luo. Avgen-bench: A task-driven benchmark for multi-granular evaluation of text-to-audio-video generation.arXiv preprint arXiv:2604.08540, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[79]

Causal forcing: Autoregressive diffusion distillation done right for high-quality real-time interactive video generation, 2026

Hongzhou Zhu, Min Zhao, Guande He, Hang Su, Chongxuan Li, and Jun Zhu. Causal forcing: Autoregressive diffusion distillation done right for high-quality real-time interactive video generation, 2026

2026
[80]

human_evidence_accepted

Cailin Zhuang, Ailin Huang, Yaoqi Hu, Jingwei Wu, Wei Cheng, Jiaqi Liao, Hongyuan Wang, Xinyao Liao, Weiwei Cai, Hengyuan Xu, et al. Vistorybench: Comprehensive benchmark suite for story visualization.arXiv preprint arXiv:2505.24862, 2025. 15 Appendix A More Data Details on MSA VBench 17 A.1 Data Design Details . . . . . . . . . . . . . . . . . . . . . . ...

work page arXiv 2025

[1] [1]

PaddleOCR 3.0 Technical Report, author=Cheng Cui and Ting Sun and Manhui Lin and Tingquan Gao and Yubo Zhang and Jiaxuan Liu and Xueqing Wang and Zelun Zhang and Changda Zhou and Hongen Liu and Yue Zhang and Wenyu Lv and Kui Huang and Yichao Zhang and Jing Zhang and Jun Zhang and Yi Liu and Dianhai Yu and Yanjun Ma, 2025

2025

[2] [2]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

Blazepose: On-device real-time body pose tracking.arXiv preprint arXiv:2006.10204, 2020

Valentin Bazarevsky, Ivan Grishchenko, Karthik Raveendran, Tyler Zhu, Fan Zhang, and Matthias Grundmann. Blazepose: On-device real-time body pose tracking.arXiv preprint arXiv:2006.10204, 2020

work page arXiv 2006

[4] [4]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Do- minik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[5] [5]

Video generation models as world simulators

Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Leo Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, et al. Video generation models as world simulators. OpenAI Blog, 1(8):1, 2024

2024

[6] [6]

Kevin Cai, Chonghua Liu, and David M. Chan. Anim-400k: A large-scale dataset for automated end to end dubbing of video. InICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2024

2024

[7] [7]

T2AV-Compass: Towards Unified Evaluation for Text-to-Audio-Video Generation

Zhe Cao, Tao Wang, Jiaming Wang, Yanghai Wang, Yuanxing Zhang, Jialu Chen, Miao Deng, Jiahao Wang, Yubin Guo, Chenxi Liao, et al. T2av-compass: Towards unified evaluation for text-to-audio-video generation.arXiv preprint arXiv:2512.21094, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[8] [8]

Talkvid: A large-scale diversified dataset for audio-driven talking head synthesis, 2025

Shunian Chen, Hejin Huang, Yexin Liu, Zihan Ye, Pengcheng Chen, Chenghao Zhu, Michael Guan, Rongsheng Wang, Junying Chen, Guanbin Li, Ser-Nam Lim, Harry Yang, and Benyou Wang. Talkvid: A large-scale diversified dataset for audio-driven talking head synthesis, 2025

2025

[9] [9]

MMAudio: Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis, 2025

Ho Kei Cheng, Masato Ishii, Akio Hayakawa, Takashi Shibuya, Alexander Schwing, and Yuki Mitsufuji. MMAudio: Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis, 2025. 10

2025

[10] [10]

W2v-bert: Combining contrastive learning and masked language modeling for self-supervised speech pre-training

Yu-An Chung, Yu Zhang, Wei Han, Chung-Cheng Chiu, James Qin, Ruoming Pang, and Yonghui Wu. W2v-bert: Combining contrastive learning and masked language modeling for self-supervised speech pre-training. In2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pages 244–250. IEEE, 2021

2021

[11] [11]

Paddleocr-vl-1.5: Towards a multi-task 0.9b vlm for robust in-the-wild document parsing, 2026

Cheng Cui, Ting Sun, Suyin Liang, Tingquan Gao, Zelun Zhang, Jiaxuan Liu, Xueqing Wang, Changda Zhou, Hongen Liu, Manhui Lin, Yue Zhang, Yubo Zhang, Yi Liu, Dianhai Yu, and Yanjun Ma. Paddleocr-vl-1.5: Towards a multi-task 0.9b vlm for robust in-the-wild document parsing, 2026

2026

[12] [12]

PaddleOCR-VL: Boosting Multilingual Document Parsing via a 0.9B Ultra-Compact Vision-Language Model, 2025

Cheng Cui, Ting Sun, Suyin Liang, Tingquan Gao, Zelun Zhang, Jiaxuan Liu, Xueqing Wang, Changda Zhou, Hongen Liu, Manhui Lin, Yue Zhang, Yubo Zhang, Handong Zheng, Jing Zhang, Jun Zhang, Yi Liu, Dianhai Yu, and Yanjun Ma. PaddleOCR-VL: Boosting Multilingual Document Parsing via a 0.9B Ultra-Compact Vision-Language Model, 2025

2025

[13] [13]

Demucs: Deep Extractor for Music Sources with extra unlabeled data remixed

Alexandre Défossez, Nicolas Usunier, Léon Bottou, and Francis Bach. Demucs: Deep extractor for music sources with extra unlabeled data remixed.arXiv preprint arXiv:1909.01174, 2019

work page Pith review arXiv 1909

[14] [14]

Arcface: Additive angular margin loss for deep face recognition

Jiankang Deng, Jia Guo, Niannan Xue, and Stefanos Zafeiriou. Arcface: Additive angular margin loss for deep face recognition. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4690–4699, 2019

2019

[15] [15]

Scaling rectified flow trans- formers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow trans- formers for high-resolution image synthesis. InForty-first international conference on machine learning, 2024

2024

[16] [16]

Veo 3.1.https://deepmind.google/technologies/veo/, 2026

Google DeepMind. Veo 3.1.https://deepmind.google/technologies/veo/, 2026

2026

[17] [17]

Audcast: Audio-driven human video generation by cascaded diffusion transformers

Jiazhi Guan, Kaisiyuan Wang, Zhiliang Xu, Quanwei Yang, Yasheng Sun, Shengyi He, Borong Liang, Yukang Cao, Yingying Li, Haocheng Feng, et al. Audcast: Audio-driven human video generation by cascaded diffusion transformers. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10678–10689, 2025

2025

[18] [18]

Dreamid-omni: Unified framework for controllable human-centric audio-video generation.arXiv preprint arXiv:2602.12160, 2026

Xu Guo, Fulong Ye, Qichao Sun, Liyang Chen, Bingchuan Li, Pengze Zhang, Jiawei Liu, Song- tao Zhao, Qian He, and Xiangwang Hou. Dreamid-omni: Unified framework for controllable human-centric audio-video generation.arXiv preprint arXiv:2602.12160, 2026

work page arXiv 2026

[19] [19]

LTX-Video: Realtime Video Latent Diffusion

Yoav HaCohen, Nisan Chiprut, Benny Brazowski, Daniel Shalem, Dudu Moshe, Eitan Richard- son, Eran Levin, Guy Shiran, Nir Zabari, Ori Gordon, Poriya Panet, Sapir Weissbuch, Victor Kulikov, Yaki Bitterman, Zeev Melumian, and Ofir Bibi. Ltx-video: Realtime video latent diffusion.arXiv preprint arXiv:2501.00103, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[20] [20]

Video-bench: Human-aligned video generation benchmark

Hui Han, Siyuan Li, Jiaqi Chen, Yiwen Yuan, Yuling Wu, Yufan Deng, Chak Tou Leong, Hanwen Du, Junchen Fu, Youhua Li, et al. Video-bench: Human-aligned video generation benchmark. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 18858–18868, 2025

2025

[21] [21]

AesRM: Improving Video Aesthetics with Expert-Level Feedback

Yujin Han, Yujie Wei, Yefei He, Xinyu Liu, Tianle Li, Zichao Yu, Andi Han, Shiwei Zhang, Tingyu Weng, and Difan Zou. Aesrm: Improving video aesthetics with expert-level feedback. arXiv preprint arXiv:2604.28078, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[22] [22]

HappyHorse 1.0.https://www.happyhorse.cn/, 2026

HappyHorse Team. HappyHorse 1.0.https://www.happyhorse.cn/, 2026

2026

[23] [23]

Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

2020

[24] [24]

Video diffusion models.Advances in neural information processing systems, 35:8633–8646, 2022

Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video diffusion models.Advances in neural information processing systems, 35:8633–8646, 2022

2022

[25] [25]

VABench: A Comprehensive Benchmark for Audio-Video Generation

Daili Hua, Xizhi Wang, Bohan Zeng, Xinyi Huang, Hao Liang, Junbo Niu, Xinlong Chen, Quanqing Xu, and Wentao Zhang. Vabench: A comprehensive benchmark for audio-video generation.arXiv preprint arXiv:2512.09299, 2025. 11

work page internal anchor Pith review Pith/arXiv arXiv 2025

[26] [26]

Self forcing: Bridging the train-test gap in autoregressive video diffusion, 2025

Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman. Self forcing: Bridging the train-test gap in autoregressive video diffusion, 2025

2025

[27] [27]

Vbench: Comprehensive benchmark suite for video generative models

Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive benchmark suite for video generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21807–21818, 2024

2024

[28] [28]

Synchformer: Efficient synchronization from sparse cues

Vladimir Iashin, Weidi Xie, Esa Rahtu, and Andrew Zisserman. Synchformer: Efficient synchronization from sparse cues. InICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5325–5329. IEEE, 2024

2024

[29] [29]

All-in-one metrical and functional structure analysis with neigh- borhood attentions on demixed audio

Taejun Kim and Juhan Nam. All-in-one metrical and functional structure analysis with neigh- borhood attentions on demixed audio. InIEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 2023

2023

[30] [30]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[31] [31]

Kling 3.0.https://klingai.com/global/, 2026

Kuaishou Technology. Kling 3.0.https://klingai.com/global/, 2026

2026

[32] [32]

arXiv preprint arXiv:2412.09262 (2024)

Chunyu Li, Chao Zhang, Weikai Xu, Jingyu Lin, Jinghui Xie, Weiguo Feng, Bingyue Peng, Cunjian Chen, and Weiwei Xing. Latentsync: Taming audio-conditioned latent diffusion models for lip sync with syncnet supervision.arXiv preprint arXiv:2412.09262, 2024

work page arXiv 2024

[33] [33]

Lr-asd: Lightweight and robust network for active speaker detection.International Journal of Computer Vision, 133(7):4749–4769, 2025

Junhua Liao, Haihan Duan, Kanghui Feng, Wanbing Zhao, Yanbing Yang, Liangyin Chen, and Yanru Chen. Lr-asd: Lightweight and robust network for active speaker detection.International Journal of Computer Vision, 133(7):4749–4769, 2025

2025

[34] [34]

Aibench: Evaluating visual-logical consistency in academic illustration generation.arXiv preprint arXiv:2603.28068, 2026

Zhaohe Liao, Kaixun Jiang, Zhihang Liu, Yujie Wei, Junqiu Yu, Quanhao Li, Hong-Tao Yu, Pandeng Li, Yuzheng Wang, Zhen Xing, et al. Aibench: Evaluating visual-logical consistency in academic illustration generation.arXiv preprint arXiv:2603.28068, 2026

work page arXiv 2026

[35] [35]

Javisgpt: A unified multi-modal llm for sounding-video comprehension and generation

Kai Liu, Jungang Li, Yuchong Sun, Shengqiong Wu, Jianzhang Gao, Daoan Zhang, Wei Zhang, Sheng Jin, Sicheng Yu, Geng Zhan, Jiayi Ji, Fan Zhou, Liang Zheng, Shuicheng Y AN, Hao Fei, and Tat-Seng Chua. Javisgpt: A unified multi-modal llm for sounding-video comprehension and generation. InThe Thirty-ninth Annual Conference on Neural Information Processing Sys...

2025

[36] [36]

Javisdit++: Unified modeling and optimization for joint audio-video generation

Kai Liu, Yanhao Zheng, Kai Wang, Shengqiong Wu, Rongjunchen Zhang, Jiebo Luo, Dimitrios Hatzinakos, Ziwei Liu, Hao Fei, and Tat-Seng Chua. Javisdit++: Unified modeling and optimization for joint audio-video generation. InThe Fourteenth International Conference on Learning Representations, 2026

2026

[37] [37]

Grounding dino: Marrying dino with grounded pre-training for open-set object detection

Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. InEuropean conference on computer vision, pages 38–55. Springer, 2024

2024

[38] [38]

Evalcrafter: Benchmarking and evaluating large video generation models.arXiv preprint arXiv:2310.11440, 2023

Yaofang Liu, Xiaodong Cun, Xuebo Liu, Xintao Wang, Yong Zhang, Haoxin Chen, Yang Liu, Tieyong Zeng, Raymond Chan, and Ying Shan. Evalcrafter: Benchmarking and evaluating large video generation models.arXiv preprint arXiv:2310.11440, 2023

work page arXiv 2023

[39] [39]

Shotstream: Streaming multi-shot video generation for inter- active storytelling.arXiv preprint arXiv:2603.25746, 2026

Yawen Luo, Xiaoyu Shi, Junhao Zhuang, Yutian Chen, Quande Liu, Xintao Wang, Pengfei Wan, and Tianfan Xue. Shotstream: Streaming multi-shot video generation for interactive storytelling. arXiv preprint arXiv:2603.25746, 2026

work page arXiv 2026

[40] [40]

Wan-Image: Pushing the Boundaries of Generative Visual Intelligence

Chaojie Mao, Chen-Wei Xie, Chongyang Zhong, Haoyou Deng, Jiaxing Zhao, Jie Xiao, Jinbo Xing, Jingfeng Zhang, Jingren Zhou, Jingyi Zhang, et al. Wan-image: Pushing the boundaries of generative visual intelligence.arXiv preprint arXiv:2604.19858, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[41] [41]

OenAI. GPT-5.4. https://openai.com/zh-Hans-CN/index/introducing-gpt-5-4/ , 2026. 12

2026

[42] [42]

Sora 2.https://openai.com/index/sora-2/, 2025

OpenAI. Sora 2.https://openai.com/index/sora-2/, 2025

2025

[43] [43]

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[44] [44]

and Balam, Jagadeesh and Ginsburg, Boris , month = dec, year =

Taejin Park, Ivan Medennikov, Kunal Dhawan, Weiqing Wang, He Huang, Nithin Rao Koluguri, Krishna C Puvvada, Jagadeesh Balam, and Boris Ginsburg. Sortformer: A novel approach for permutation-resolved speaker supervision in speech-to-text systems.arXiv preprint arXiv:2409.06656, 2024

work page arXiv 2024

[45] [45]

Movie Gen: A Cast of Media Foundation Models

Adam Polyak, Amit Zohar, Andrew Brown, Andros Tjandra, Animesh Sinha, Ann Lee, Apoorv Vyas, Bowen Shi, Chih-Yao Ma, Ching-Yao Chuang, et al. Movie gen: A cast of media foundation models.arXiv preprint arXiv:2410.13720, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[46] [46]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021

2021

[47] [47]

Robust speech recognition via large-scale weak supervision

Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision. InInternational conference on machine learning, pages 28492–28518. PMLR, 2023

2023

[48] [48]

Seedance 2.0: Advancing Video Generation for World Complexity

Team Seedance, De Chen, Liyang Chen, Xin Chen, Ying Chen, Zhuo Chen, Zhuowei Chen, Feng Cheng, Tianheng Cheng, Yufeng Cheng, et al. Seedance 2.0: Advancing video generation for world complexity.arXiv preprint arXiv:2604.14148, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[49] [49]

Hunyuanvideo-foley: Multimodal diffusion with representation alignment for high- fidelity foley audio generation, 2025

Sizhe Shan, Qiulin Li, Yutao Cui, Miles Yang, Yuehai Wang, Qun Yang, Jin Zhou, and Zhao Zhong. Hunyuanvideo-foley: Multimodal diffusion with representation alignment for high- fidelity foley audio generation, 2025

2025

[50] [50]

Msvbench: Towards human-level evaluation of multi-shot video generation

Haoyuan Shi, Yunxin Li, Nanhao Deng, Zhenran Xu, Xinyu Chen, Longyue Wang, Baotian Hu, and Min Zhang. Msvbench: Towards human-level evaluation of multi-shot video generation. arXiv preprint arXiv:2602.23969, 2026

work page arXiv 2026

[51] [51]

SII-GAIR, Sand. ai, Ethan Chern, Hansi Teng, Hanwen Sun, Hao Wang, Hong Pan, Hongyu Jia, Jiadi Su, Jin Li, Junjie Yu, Lijie Liu, Lingzhi Li, Lyumanshan Ye, Min Hu, Qiangang Wang, Quanwei Qi, Steffi Chern, Tao Bu, Taoran Wang, Teren Xu, Tianning Zhang, Tiantian Mi, Weixian Xu, Wenqiang Zhang, Wentai Zhang, Xianping Yi, Xiaojie Cai, Xiaoyang Kang, Yan Ma, Y...

work page arXiv 2026

[52] [52]

Make-A-Video: Text-to-Video Generation without Text-Video Data

Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, et al. Make-a-video: Text-to-video generation without text-video data.arXiv preprint arXiv:2209.14792, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[53] [53]

arXiv preprint arXiv:2404.01292 (2024)

Gowthami Somepalli, Anubhav Gupta, Kamal Gupta, Shramay Palta, Micah Goldblum, Jonas Geiping, Abhinav Shrivastava, and Tom Goldstein. Measuring style similarity in diffusion models.arXiv preprint arXiv:2404.01292, 2024

work page arXiv 2024

[54] [54]

Transnet v2: An ef- fective deep network architecture for fast shot transition detection

Tomáš Souˇcek and Jakub Lokoˇc. Transnet v2: An effective deep network architecture for fast shot transition detection.arXiv preprint arXiv:2008.04838, 2020

work page arXiv 2008

[55] [55]

The proof and measurement of association between two things

Charles Spearman. The proof and measurement of association between two things. 1961

1961

[56] [56]

Mova: Towards scalable and synchronized video-audio generation.arXiv preprint arXiv:2602.08794, 2026

OpenMOSS Team, Donghua Yu, Mingshu Chen, Qi Chen, Qi Luo, Qianyi Wu, Qinyuan Cheng, Ruixiao Li, Tianyi Liang, Wenbo Zhang, et al. Mova: Towards scalable and synchronized video-audio generation.arXiv preprint arXiv:2602.08794, 2026

work page arXiv 2026

[57] [57]

Qwen3.5: Accelerating productivity with native multimodal agents, February 2026

Qwen Team. Qwen3.5: Accelerating productivity with native multimodal agents, February 2026. 13

2026

[58] [58]

Silero V AD: pre-trained enterprise-grade V oice Activity Detector (V AD), Number Detector and Language Classifier.https://github.com/snakers4/silero-vad, 2024

Silero Team. Silero V AD: pre-trained enterprise-grade V oice Activity Detector (V AD), Number Detector and Language Classifier.https://github.com/snakers4/silero-vad, 2024

2024

[59] [59]

Gemini 3.1 Pro

The Gemini Team. Gemini 3.1 Pro. https://blog.google/innovation-and-ai/ models-and-research/gemini-models/gemini-3-1-pro/, 2026

2026

[60] [60]

Meta audiobox aesthetics: Unified automatic quality assessment for speech, music, and sound

Andros Tjandra, Yi-Chiao Wu, Baishan Guo, John Hoffman, Brian Ellis, Apoorv Vyas, Bowen Shi, Sanyuan Chen, Matt Le, Nick Zacharov, Carleigh Wood, Ann Lee, and Wei-Ning Hsu. Meta audiobox aesthetics: Unified automatic quality assessment for speech, music, and sound. 2025

2025

[61] [61]

Wan2.7.https://www.wan27.xyz/, 2026

Tongyi Wanxiang Team. Wan2.7.https://www.wan27.xyz/, 2026

2026

[62] [62]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[63] [63]

AV-DiT: Efficient Audio-Visual Diffusion Transformer for Joint Audio and Video Generation

Kai Wang, Shijian Deng, Jing Shi, Dimitrios Hatzinakos, and Yapeng Tian. Av-dit: Effi- cient audio-visual diffusion transformer for joint audio and video generation.arXiv preprint arXiv:2406.07686, 2024

work page Pith review arXiv 2024

[64] [64]

Japanese Anime Scenes

Wei Wang. Japanese Anime Scenes. https://www.kaggle.com/datasets/weiwangk/ japanese-anime-scenes, 2023

2023

[65] [65]

arXiv preprint arXiv:2602.21835 , year=

Jianhui Wei, Xiaotian Zhang, Yichen Li, Yuan Wang, Yan Zhang, Ziyi Chen, Zhihang Tang, Wei Xu, and Zuozhu Liu. Univbench: Towards unified evaluation for video foundation models. arXiv preprint arXiv:2602.21835, 2026

work page arXiv 2026

[66] [66]

Dreamvideo-omni: Omni-motion controlled multi-subject video customization with latent identity reinforcement learning.arXiv preprint arXiv:2603.12257, 2026

Yujie Wei, Xinyu Liu, Shiwei Zhang, Hangjie Yuan, Jinbo Xing, Zhekai Chen, Xiang Wang, Haonan Qiu, Rui Zhao, Yutong Feng, et al. Dreamvideo-omni: Omni-motion controlled multi-subject video customization with latent identity reinforcement learning.arXiv preprint arXiv:2603.12257, 2026

work page arXiv 2026

[67] [67]

Dreamvideo: Composing your dream videos with cus- tomized subject and motion

Yujie Wei, Shiwei Zhang, Zhiwu Qing, Hangjie Yuan, Zhiheng Liu, Yu Liu, Yingya Zhang, Jingren Zhou, and Hongming Shan. Dreamvideo: Composing your dream videos with cus- tomized subject and motion. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6537–6549, 2024

2024

[68] [68]

Dreamrelation: Relation-centric video customization

Yujie Wei, Shiwei Zhang, Hangjie Yuan, Biao Gong, Longxiang Tang, Xiang Wang, Haonan Qiu, Hengjia Li, Shuai Tan, Yingya Zhang, et al. Dreamrelation: Relation-centric video customization. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 12381–12393, 2025

2025

[69] [69]

arXiv preprint arXiv:2510.24711 (2025)

Yujie Wei, Shiwei Zhang, Hangjie Yuan, Yujin Han, Zhekai Chen, Jiayu Wang, Difan Zou, Xihui Liu, Yingya Zhang, Yu Liu, et al. Routing matters in moe: Scaling diffusion transformers with explicit routing guidance.arXiv preprint arXiv:2510.24711, 2025

work page arXiv 2025

[70] [70]

Dreamvideo-2: Zero-shot subject-driven video customization with precise motion control,

Yujie Wei, Shiwei Zhang, Hangjie Yuan, Xiang Wang, Haonan Qiu, Rui Zhao, Yutong Feng, Feng Liu, Zhizhong Huang, Jiaxin Ye, et al. Dreamvideo-2: Zero-shot subject-driven video customization with precise motion control.arXiv preprint arXiv:2410.13830, 2024

work page arXiv 2024

[71] [71]

PhyAVBench: A Challenging Audio Physics-Sensitivity Benchmark for Physically Grounded Text-to-Audio-Video Generation

Tianxin Xie, Wentao Lei, Kai Jiang, Guanjie Huang, Pengfei Zhang, Chunhui Zhang, Fengji Ma, Haoyu He, Han Zhang, Jiangshan He, et al. Phyavbench: A challenging audio physics- sensitivity benchmark for physically grounded text-to-audio-video generation.arXiv preprint arXiv:2512.23994, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[72] [72]

Fireredasr2s: A state-of-the-art industrial-grade all-in-one automatic speech recognition system.arXiv preprint arXiv:2603.10420, 2026

Kaituo Xu, Yan Jia, Kai Huang, Junjie Chen, Wenpeng Li, Kun Liu, Feng-Long Xie, Xu Tang, and Yao Hu. Fireredasr2s: A state-of-the-art industrial-grade all-in-one automatic speech recognition system.arXiv preprint arXiv:2603.10420, 2026

work page arXiv 2026

[73] [73]

Longlive: Real-time interactive long video generation

Shuai Yang, Wei Huang, Ruihang Chu, Yicheng Xiao, Yuyang Zhao, Xianbang Wang, Muyang Li, Enze Xie, Yingcong Chen, Yao Lu, and Song Hanand Yukang Chen. Longlive: Real-time interactive long video generation. 2025. 14

2025

[74] [74]

Opens2v-nexus: A detailed benchmark and million-scale dataset for subject-to-video generation.arXivpreprint arXiv:2505.20292, 2025

Shenghai Yuan, Xianyi He, Yufan Deng, Yang Ye, Jinfa Huang, Bin Lin, Jiebo Luo, and Li Yuan. Opens2v-nexus: A detailed benchmark and million-scale dataset for subject-to-video generation. arXiv preprint arXiv:2505.20292, 2025

work page arXiv 2025

[75] [75]

Helios: Real real-time long video generation model.arXiv preprint arXiv:2603.04379, 2026

Shenghai Yuan, Yuanyang Yin, Zongjian Li, Xinwei Huang, Xiao Yang, and Li Yuan. Helios: Real real-time long video generation model.arXiv preprint arXiv:2603.04379, 2026

work page arXiv 2026

[76] [76]

Uniform: A unified multi-task diffusion transformer for audio-video generation.arXiv preprint arXiv:2502.03897,

Lei Zhao, Linfeng Feng, Dongxu Ge, Rujin Chen, Fangqiu Yi, Chi Zhang, Xiao-Lei Zhang, and Xuelong Li. Uniform: A unified multi-task diffusion transformer for audio-video generation. arXiv preprint arXiv:2502.03897, 2025

work page arXiv 2025

[77] [77]

MTAVG-Bench: A Diagnostic Benchmark for Multi-Talker Dialogue-Centric Audio-Video Generation

Yang-Hao Zhou, Haitian Li, Rexar Lin, Heyan Huang, Jinxing Zhou, Changsen Yuan, Tian Lan, Ziqin Zhou, Yudong Li, Jiajun Xu, et al. Mtavg-bench: A comprehensive benchmark for evalu- ating multi-talker dialogue-centric audio-video generation.arXiv preprint arXiv:2602.00607, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[78] [78]

AVGen-Bench: A Task-Driven Benchmark for Multi-Granular Evaluation of Text-to-Audio-Video Generation

Ziwei Zhou, Zeyuan Lai, Rui Wang, Yifan Yang, Zhen Xing, Yuqing Yang, Qi Dai, Lili Qiu, and Chong Luo. Avgen-bench: A task-driven benchmark for multi-granular evaluation of text-to-audio-video generation.arXiv preprint arXiv:2604.08540, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[79] [79]

Causal forcing: Autoregressive diffusion distillation done right for high-quality real-time interactive video generation, 2026

Hongzhou Zhu, Min Zhao, Guande He, Hang Su, Chongxuan Li, and Jun Zhu. Causal forcing: Autoregressive diffusion distillation done right for high-quality real-time interactive video generation, 2026

2026

[80] [80]

human_evidence_accepted

Cailin Zhuang, Ailin Huang, Yaoqi Hu, Jingwei Wu, Wei Cheng, Jiaqi Liao, Hongyuan Wang, Xinyao Liao, Weiwei Cai, Hengyuan Xu, et al. Vistorybench: Comprehensive benchmark suite for story visualization.arXiv preprint arXiv:2505.24862, 2025. 15 Appendix A More Data Details on MSA VBench 17 A.1 Data Design Details . . . . . . . . . . . . . . . . . . . . . . ...

work page arXiv 2025