MSAVBench is the first comprehensive benchmark for multi-shot audio-video generation, spanning video, audio, shot, and reference dimensions with an adaptive evaluation framework that reaches 91.5% Spearman correlation with human judgments.
Uniform: A unified multi-task diffusion transformer for audio-video generation
4 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 2polarities
background 2representative citing papers
PhyAVBench provides the first systematic benchmark and metric for audio-physics grounding in T2AV, I2AV, and V2A models using controlled prompt pairs and real video ground truth.
SyncDPO improves temporal synchronization in video-audio joint generation using DPO with efficient on-the-fly negative sample construction and curriculum learning.
Unison introduces a unified framework using semantic-guided harmonization and bidirectional cross-modal forcing to generate human-centric videos with improved synchronization between motion, speech, and sound effects.
citing papers explorer
-
MSAVBench: Towards Comprehensive and Reliable Evaluation of Multi-Shot Audio-Video Generation
MSAVBench is the first comprehensive benchmark for multi-shot audio-video generation, spanning video, audio, shot, and reference dimensions with an adaptive evaluation framework that reaches 91.5% Spearman correlation with human judgments.
-
PhyAVBench: A Challenging Audio Physics-Sensitivity Benchmark for Physically Grounded Text-to-Audio-Video Generation
PhyAVBench provides the first systematic benchmark and metric for audio-physics grounding in T2AV, I2AV, and V2A models using controlled prompt pairs and real video ground truth.
-
SyncDPO: Enhancing Temporal Synchronization in Video-Audio Joint Generation via Preference Learning
SyncDPO improves temporal synchronization in video-audio joint generation using DPO with efficient on-the-fly negative sample construction and curriculum learning.
-
Unison: Harmonizing Motion, Speech, and Sound for Human-Centric Audio-Video Generation
Unison introduces a unified framework using semantic-guided harmonization and bidirectional cross-modal forcing to generate human-centric videos with improved synchronization between motion, speech, and sound effects.