pith. machine review for the scientific record. sign in

arxiv: 2605.12179 · v1 · submitted 2026-05-12 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

SyncDPO: Enhancing Temporal Synchronization in Video-Audio Joint Generation via Preference Learning

Authors on Pith no claims yet

Pith reviewed 2026-05-13 06:55 UTC · model grok-4.3

classification 💻 cs.CV
keywords video audio generationtemporal synchronizationdirect preference optimizationDPOcurriculum learningmultimodal modelsalignment
0
0 comments X

The pith

SyncDPO uses direct preference optimization with on-the-fly temporal distortions to improve video-audio synchronization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Achieving precise timing between audio events and visual triggers in generated videos remains difficult even when semantic content matches well. Standard supervised fine-tuning relies on mean squared error losses that do not strongly discourage small timing errors. SyncDPO instead frames the problem as preference optimization by generating negative examples through rule-based distortions of the temporal alignment between video and audio. These negatives are created efficiently without extra sampling or human labeling, and training follows a curriculum from large to small misalignments. Objective and subjective tests on four benchmarks confirm gains in alignment quality and in handling unseen data distributions.

Core claim

SyncDPO is a post-training framework that leverages Direct Preference Optimization to enhance temporal sensitivity in video-audio joint generation models. It replaces costly preference pair construction with on-the-fly rule-based strategies that distort temporal structures, and employs curriculum learning to gradually increase the subtlety of misalignments. This approach yields models with improved temporal alignment and better out-of-distribution generalization compared to supervised fine-tuning baselines.

What carries the argument

On-the-fly rule-based negative construction strategies for creating temporally misaligned video-audio pairs to serve as negatives in Direct Preference Optimization, supported by a progressive curriculum.

If this is right

  • The resulting models exhibit stronger temporal alignment on in-distribution benchmarks.
  • Generalization improves on out-of-distribution test sets by better capturing motion-sound relationships.
  • Training remains computationally efficient by avoiding separate sampling and ranking steps.
  • The method applies across varied domains including ambient sounds and speech videos.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar on-the-fly negative construction could accelerate preference tuning in other sequence alignment problems.
  • The curriculum design may offer a template for stabilizing preference optimization when negative quality varies.
  • If effective, this reduces reliance on human or model-based ranking for creating training signals in multimodal settings.

Load-bearing premise

The synthetic temporal distortions generated by the rules are representative enough of real misalignments to train the model effectively through preference comparisons.

What would settle it

Observing no statistically significant difference in temporal synchronization metrics between SyncDPO and baseline models when evaluated on the four benchmarks would indicate the approach does not deliver the claimed improvements.

Figures

Figures reproduced from arXiv: 2605.12179 by Kaisi Guan, Ruihua Song, Wenpu Li, Xihua Wang, Xin Cheng, Yinbo Wang, Ying Ba, Yuyue Wang.

Figure 1
Figure 1. Figure 1: A direct comparison of SFT, DPO, and SyncDPO paradigms. (a) shows the difference [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Visualization of on-the-fly negative construction for SyncDPO. The video or audio modality [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Human preference evaluation and qualitative comparison between SFT and SyncDPO. [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
read the original abstract

Recent advancements in video-audio joint generation have achieved remarkable success in semantic correspondence. However, achieving precise temporal synchronization, which requires fine-grained alignment between audio events and their visual triggers, remains a challenging problem. The post-training method for joint generation is largely dominated by Supervised Fine-Tuning, but the commonly used Mean Squared Error loss provides insufficient penalties for subtle temporal misalignments. Direct Preference Optimization offers an alternative by introducing explicit misaligned counterparts to better improve temporal sensitivity. In this paper we propose a post-training framework SyncDPO, leveraging DPO to improve the temporal sensitivity of V-A joint generation. Conventional DPO pipelines typically depend on costly sampling-and-ranking procedures to construct preference pairs, resulting in substantial computational cost. To improve efficiency, we introduce a suite of on-the-fly rule-based negative construction strategies that distort temporal structures without incurring additional annotation or sampling. We demonstrate that the temporal alignment capability can be effectively reinforced by providing explicit negative supervision through temporally distorted V-A pairs. Accordingly, we implement a curriculum learning strategy that progressively increases the difficulty of negative samples, transitioning from coarse misalignment to subtle inconsistencies. Extensive objective and subjective experiments across four diverse benchmarks, ranging from ambient sound videos to human speech videos, demonstrate that SyncDPO significantly outperforms other methods in improving model's temporal alignment capability. It also demonstrates superior generalization on out-of-distribution benchmark by capturing intrinsic motion-sound dynamics. Demo and code is available in https://syncdpo.github.io/syncdpo/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript proposes SyncDPO, a post-training framework that applies Direct Preference Optimization (DPO) to video-audio joint generation models to improve temporal synchronization. It replaces costly sampling-and-ranking for preference pairs with on-the-fly rule-based negative construction strategies that apply temporal distortions such as shifts, stretches, and drops. A curriculum learning schedule progressively increases negative difficulty from coarse to subtle misalignments. The authors claim this yields superior temporal alignment on four benchmarks spanning ambient sound and human speech videos, plus better out-of-distribution generalization, with code and demos released.

Significance. If the results hold, the work offers an efficient route to post-train multimodal generators for fine-grained timing without extra annotation or heavy inference-time sampling. The explicit release of code and demo is a clear strength for reproducibility. The approach targets a genuine gap where semantic correspondence is already strong but precise event-level synchronization remains weak.

major comments (1)
  1. [Methods (negative construction strategies)] The central assumption that the on-the-fly rule-based negative construction strategies (temporal shifts, stretches, drops) produce misalignment distributions equivalent to those arising during actual generative sampling is load-bearing for the entire claim. The manuscript provides no analysis or ablation demonstrating that these synthetic distortions capture the correlated visual-audio artifacts typical of diffusion or autoregressive sampling errors, rather than merely varying difficulty inside an artificial family.
minor comments (2)
  1. [Abstract] The abstract asserts outperformance across benchmarks but omits concrete metric values, baseline names, and any mention of statistical tests or variance, which reduces immediate assessability of the empirical claims.
  2. Notation for the preference pairs and the curriculum schedule should be introduced with explicit equations or pseudocode to improve clarity for readers unfamiliar with the exact distortion operators.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for this valuable comment on our negative construction approach. We provide a detailed response below and will update the manuscript accordingly.

read point-by-point responses
  1. Referee: [Methods (negative construction strategies)] The central assumption that the on-the-fly rule-based negative construction strategies (temporal shifts, stretches, drops) produce misalignment distributions equivalent to those arising during actual generative sampling is load-bearing for the entire claim. The manuscript provides no analysis or ablation demonstrating that these synthetic distortions capture the correlated visual-audio artifacts typical of diffusion or autoregressive sampling errors, rather than merely varying difficulty inside an artificial family.

    Authors: We agree that a more rigorous validation of the negative construction strategies would bolster the paper. Our strategies are motivated by common temporal misalignment patterns seen in generated video-audio pairs, including those from diffusion models, such as audio lagging behind visual events or mismatched durations. Although we did not include a direct distributional comparison in the original submission, the superior performance on real benchmarks and OOD generalization suggest that the approach effectively targets relevant misalignment types. In the revised version, we will add an analysis section with examples of base model failures and how our negatives relate to them, along with an ablation on the curriculum stages. This will demonstrate that the distortions are not merely artificial but capture key aspects of the problem. revision: yes

Circularity Check

0 steps flagged

No significant circularity; builds on established DPO with novel rule-based negatives

full rationale

The paper's derivation applies the standard DPO loss to video-audio pairs using explicitly defined on-the-fly rule-based distortions (temporal shifts, stretches, drops) for negative samples and a curriculum schedule. These construction rules are stated as independent of model outputs and do not reduce any claimed prediction to a fitted parameter or self-definition. No load-bearing self-citations, uniqueness theorems, or smuggled ansatzes appear in the method; the central claim of improved temporal sensitivity rests on the new negative-construction procedure rather than on prior author results. Experimental results are reported as external validation, not as part of the derivation chain itself.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Relies on standard DPO assumptions and the effectiveness of rule-based distortions as proxies for misalignment; no explicit free parameters or invented entities detailed in abstract.

axioms (1)
  • domain assumption Direct Preference Optimization improves sensitivity to temporal misalignments when provided with explicit negative pairs
    Core premise drawn from prior DPO literature applied here to video-audio generation.

pith-pipeline@v0.9.0 · 5585 in / 1139 out tokens · 68617 ms · 2026-05-13T06:55:32.946570+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

52 extracted references · 52 canonical work pages · 8 internal anchors

  1. [1]

    Deep audio-visual speech recognition.IEEE transactions on pattern analysis and machine intelligence, 44(12):8717–8727, 2018

    Triantafyllos Afouras, Joon Son Chung, Andrew Senior, Oriol Vinyals, and Andrew Zisserman. Deep audio-visual speech recognition.IEEE transactions on pattern analysis and machine intelligence, 44(12):8717–8727, 2018

  2. [2]

    Qwen2.5-VL Technical Report

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report. a...

  3. [3]

    Curriculum learning

    Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. Curriculum learning. InProceedings of the 26th annual international conference on machine learning, pages 41–48, 2009

  4. [4]

    Training Diffusion Models with Reinforcement Learning

    Kevin Black, Michael Janner, Yilun Du, Ilya Kostrikov, and Sergey Levine. Training diffusion models with reinforcement learning.arXiv preprint arXiv:2305.13301, 2023

  5. [5]

    Video generation models as world simulators

    Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, Clarence Ng, Ricky Wang, and Aditya Ramesh. Video generation models as world simulators. 2024. URL https://openai.com/research/ video-generation-models-as-world-simulators

  6. [6]

    Quo vadis, action recognition? a new model and the kinetics dataset

    Joao Carreira and Andrew Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. Inproceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6299–6308, 2017

  7. [7]

    Vggsound: A large-scale audio-visual dataset

    Honglie Chen, Weidi Xie, Andrea Vedaldi, and Andrew Zisserman. Vggsound: A large-scale audio-visual dataset. InICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 721–725. IEEE, 2020

  8. [8]

    MMAudio: Taming multimodal joint training for high-quality video-to-audio synthesis

    Ho Kei Cheng, Masato Ishii, Akio Hayakawa, Takashi Shibuya, Alexander Schwing, and Yuki Mitsufuji. MMAudio: Taming multimodal joint training for high-quality video-to-audio synthesis. InCVPR, 2025

  9. [9]

    Lova: Long-form video-to-audio generation.arXiv preprint arXiv:2409.15157, 2024

    Xin Cheng, Xihua Wang, Yihan Wu, Yuyue Wang, and Ruihua Song. Lova: Long-form video-to-audio generation.arXiv preprint arXiv:2409.15157, 2024

  10. [10]

    Vssflow: Unifying video-conditioned sound and speech generation via joint learning.arXiv preprint arXiv:2509.24773, 2025

    Xin Cheng, Yuyue Wang, Xihua Wang, Yihan Wu, Kaisi Guan, Yijing Chen, Peng Zhang, Xiaojiang Liu, Meng Cao, and Ruihua Song. Vssflow: Unifying video-conditioned sound and speech generation via joint learning.arXiv preprint arXiv:2509.24773, 2025

  11. [11]

    J. S. Chung and A. Zisserman. Out of time: automated lip sync in the wild. InWorkshop on Multi-view Lip-reading, ACCV, 2016

  12. [12]

    Curriculum direct preference optimization for diffusion and consistency models

    Florinel-Alin Croitoru, Vlad Hondru, Radu Tudor Ionescu, Nicu Sebe, and Mubarak Shah. Curriculum direct preference optimization for diffusion and consistency models. InProceedings of CVPR, page [TBA], 2025

  13. [13]

    Clap learning audio concepts from natural language supervision

    Benjamin Elizalde, Soham Deshmukh, Mahmoud Al Ismail, and Huaming Wang. Clap learning audio concepts from natural language supervision. InICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2023. 10

  14. [14]

    Dpok: Reinforcement learning for fine-tuning text-to-image diffusion models.Advances in Neural Information Processing Systems, 36:79858–79885, 2023

    Ying Fan, Olivia Watkins, Yuqing Du, Hao Liu, Moonkyung Ryu, Craig Boutilier, Pieter Abbeel, Mohammad Ghavamzadeh, Kangwook Lee, and Kimin Lee. Dpok: Reinforcement learning for fine-tuning text-to-image diffusion models.Advances in Neural Information Processing Systems, 36:79858–79885, 2023

  15. [15]

    Imagebind: One embedding space to bind them all

    Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, and Ishan Misra. Imagebind: One embedding space to bind them all. InCVPR, 2023

  16. [16]

    Veo 3 AI Video Generator with Realistic Sound.https://www.veo3.io/, 2025

    Google. Veo 3 AI Video Generator with Realistic Sound.https://www.veo3.io/, 2025

  17. [17]

    Taming text-to-sounding video generation via advanced modality condition and interaction, 2025

    Kaisi Guan, Xihua Wang, Zhengfeng Lai, Xin Cheng, Peng Zhang, XiaoJiang Liu, Ruihua Song, and Meng Cao. Taming text-to-sounding video generation via advanced modality condition and interaction.arXiv preprint arXiv:2510.03117, 2025

  18. [18]

    LTX-2: Efficient Joint Audio-Visual Foundation Model

    Yoav HaCohen, Benny Brazowski, Nisan Chiprut, Yaki Bitterman, Andrew Kvochko, Avishai Berkowitz, Daniel Shalem, Daphna Lifschitz, Dudu Moshe, Eitan Porat, et al. Ltx-2: Efficient joint audio-visual foundation model.arXiv preprint arXiv:2601.03233, 2026

  19. [19]

    VABench: A Comprehensive Benchmark for Audio-Video Generation

    Daili Hua, Xizhi Wang, Bohan Zeng, Xinyi Huang, Hao Liang, Junbo Niu, Xinlong Chen, Quanqing Xu, and Wentao Zhang. Vabench: A comprehensive benchmark for audio-video generation.arXiv preprint arXiv:2512.09299, 2025

  20. [20]

    Jova: Unified multimodal learning for joint video-audio generation.arXiv preprint arXiv:2512.13677, 2025

    Xiaohu Huang, Hao Zhou, Qiangpeng Yang, Shilei Wen, and Kai Han. Jova: Unified multimodal learning for joint video-audio generation.arXiv preprint arXiv:2512.13677, 2025

  21. [21]

    Synchformer: Efficient synchronization from sparse cues

    Vladimir Iashin, Weidi Xie, Esa Rahtu, and Andrew Zisserman. Synchformer: Efficient synchronization from sparse cues. InICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5325–5329. IEEE, 2024

  22. [22]

    Panns: Large-scale pretrained audio neural networks for audio pattern recognition.IEEE/ACM Transactions on Audio, Speech, and Language Processing, 28:2880–2894, 2020

    Qiuqiang Kong, Yin Cao, Turab Iqbal, Yuxuan Wang, Wenwu Wang, and Mark D Plumbley. Panns: Large-scale pretrained audio neural networks for audio pattern recognition.IEEE/ACM Transactions on Audio, Speech, and Language Processing, 28:2880–2894, 2020

  23. [23]

    Efficient training of audio transformers with patchout

    Khaled Koutini, Jan Schlüter, Hamid Eghbal-zadeh, and Gerhard Widmer. Efficient training of audio transformers with patchout. InInterspeech 2022, 23rd Annual Conference of the International Speech Communication Association, Incheon, Korea, 18-22 September 2022, pages 2753–2757. ISCA, 2022. doi: 10.21437/Interspeech.2022-227

  24. [24]

    Aesthetic post-training diffusion models from generic preferences with step- by-step preference optimization

    Zhanhao Liang, Yuhui Yuan, Shuyang Gu, Bohan Chen, Tiankai Hang, Mingxi Cheng, Ji Li, and Liang Zheng. Aesthetic post-training diffusion models from generic preferences with step- by-step preference optimization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13199–13208, 2025

  25. [25]

    Flow Matching for Generative Modeling

    Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022

  26. [26]

    Improving Video Generation with Human Feedback

    Jie Liu, Gongye Liu, Jiajun Liang, Ziyang Yuan, Xiaokun Liu, Mingwu Zheng, Xiele Wu, Qiulin Wang, Menghan Xia, Xintao Wang, et al. Improving video generation with human feedback.arXiv preprint arXiv:2501.13918, 2025

  27. [27]

    Javisdit: Joint audio-video diffusion transformer with hierarchical spatio-temporal prior synchronization.arXiv preprint arXiv:2503.23377, 2025

    Kai Liu, Wei Li, Lai Chen, Shengqiong Wu, Yanhao Zheng, Jiayi Ji, Fan Zhou, Rongxin Jiang, Jiebo Luo, Hao Fei, et al. Javisdit: Joint audio-video diffusion transformer with hierarchical spatio-temporal prior synchronization.arXiv preprint arXiv:2503.23377, 2025

  28. [28]

    Javisdit++: Unified modeling and optimization for joint audio-video generation.arXiv preprint arXiv:2602.19163, 2026

    Kai Liu, Yanhao Zheng, Kai Wang, Shengqiong Wu, Rongjunchen Zhang, Jiebo Luo, Dimitrios Hatzinakos, Ziwei Liu, Hao Fei, and Tat-Seng Chua. Javisdit++: Unified modeling and optimization for joint audio-video generation.arXiv preprint arXiv:2602.19163, 2026

  29. [29]

    Videodpo: Omni-preference alignment for video diffusion generation

    Runtao Liu, Haoyu Wu, Ziqiang Zheng, Chen Wei, Yingqing He, Renjie Pi, and Qifeng Chen. Videodpo: Omni-preference alignment for video diffusion generation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 8009–8019, 2025. 11

  30. [30]

    Ovi: Twin backbone cross-modal fusion for audio-video generation

    Chetwin Low, Weimin Wang, and Calder Katyal. Ovi: Twin backbone cross-modal fusion for audio-video generation.arXiv preprint arXiv:2510.01284, 2025

  31. [31]

    Visually indicated sounds

    Andrew Owens, Phillip Isola, Josh McDermott, Antonio Torralba, Edward H Adelson, and William T Freeman. Visually indicated sounds. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 2405–2413, 2016

  32. [32]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021

  33. [33]

    Robust speech recognition via large-scale weak supervision

    Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision. InInternational conference on machine learning, pages 28492–28518. PMLR, 2023

  34. [34]

    Direct preference optimization: Your language model is secretly a reward model

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in neural information processing systems, 36:53728–53741, 2023

  35. [35]

    Mm-diffusion: Learning multi-modal diffusion models for joint audio and video generation

    Ludan Ruan, Yiyang Ma, Huan Yang, Huiguo He, Bei Liu, Jianlong Fu, Nicholas Jing Yuan, Qin Jin, and Baining Guo. Mm-diffusion: Learning multi-modal diffusion models for joint audio and video generation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10219–10228, 2023

  36. [36]

    Utmos: Utokyo-sarulab sys- tem for voicemos challenge 2022.arXiv preprint arXiv:2204.02152,

    Takaaki Saeki, Detai Xin, Wataru Nakata, Tomoki Koriyama, Shinnosuke Takamichi, and Hiroshi Saruwatari. Utmos: Utokyo-sarulab system for voicemos challenge 2022.arXiv preprint arXiv:2204.02152, 2022

  37. [37]

    Seedance 1.5 pro: A native audio-visual joint generation foundation model.arXiv preprint arXiv:2512.13507, 2025

    Team Seedance, Heyi Chen, Siyan Chen, Xin Chen, Yanfei Chen, Ying Chen, Zhuo Chen, Feng Cheng, Tianheng Cheng, Xinqi Cheng, et al. Seedance 1.5 pro: A native audio-visual joint generation foundation model.arXiv preprint arXiv:2512.13507, 2025

  38. [38]

    Any-to-any generation via composable diffusion.Advances in Neural Information Processing Systems, 36: 16083–16099, 2023

    Zineng Tang, Ziyi Yang, Chenguang Zhu, Michael Zeng, and Mohit Bansal. Any-to-any generation via composable diffusion.Advances in Neural Information Processing Systems, 36: 16083–16099, 2023

  39. [39]

    Mochi 1.https://github.com/genmoai/models, 2024

    Genmo Team. Mochi 1.https://github.com/genmoai/models, 2024

  40. [40]

    Diffusion model alignment using direct preference optimization

    Bram Wallace, Meihua Dang, Rafael Rafailov, Linqi Zhou, Aaron Lou, Senthil Purushwalkam, Stefano Ermon, Caiming Xiong, Shafiq Joty, and Nikhil Naik. Diffusion model alignment using direct preference optimization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8228–8238, 2024

  41. [41]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

  42. [42]

    Universe-1: Unified audio-video generation via stitching of experts.arXiv preprint arXiv:2509.06155, 2025

    Duomin Wang, Wei Zuo, Aojie Li, Ling-Hao Chen, Xinyao Liao, Deyu Zhou, Zixin Yin, Xili Dai, Daxin Jiang, and Gang Yu. Universe-1: Unified audio-video generation via stitching of experts.arXiv preprint arXiv:2509.06155, 2025

  43. [43]

    Av-dit: Effi- cient audio-visual diffusion transformer for joint audio and video generation.arXiv preprint arXiv:2406.07686, 2024

    Kai Wang, Shijian Deng, Jing Shi, Dimitrios Hatzinakos, and Yapeng Tian. Av-dit: Effi- cient audio-visual diffusion transformer for joint audio and video generation.arXiv preprint arXiv:2406.07686, 2024

  44. [44]

    Koala-36m: A large-scale video dataset improving consistency between fine-grained conditions and video content

    Qiuheng Wang, Yukai Shi, Jiarong Ou, Rui Chen, Ke Lin, Jiahao Wang, Boyuan Jiang, Haotian Yang, Mingwu Zheng, Xin Tao, et al. Koala-36m: A large-scale video dataset improving consistency between fine-grained conditions and video content. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 8428–8437, 2025

  45. [45]

    Animate and sound an image

    Xihua Wang, Ruihua Song, Chongxuan Li, Xin Cheng, Boyuan Li, Yihan Wu, Yuyue Wang, Hongteng Xu, and Yunfeng Wang. Animate and sound an image. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 23369–23378, 2025. 12

  46. [46]

    ESPnet: End-to-end speech processing toolkit

    Shinji Watanabe, Takaaki Hori, Shigeki Karita, Tomoki Hayashi, Jiro Nishitoba, Yuya Unno, Nelson Enrique Yalta Soplin, Jahn Heymann, Matthew Wiesner, Nanxin Chen, Adithya Ren- duchintala, and Tsubasa Ochiai. ESPnet: End-to-end speech processing toolkit. InProceed- ings of Interspeech, pages 2207–2211, 2018. doi: 10.21437/Interspeech.2018-1456. URL http://...

  47. [47]

    Qwen2.5-Omni Technical Report

    Jin Xu, Zhifang Guo, Jinzheng He, Hangrui Hu, Ting He, Shuai Bai, Keqin Chen, Jialin Wang, Yang Fan, Kai Dang, Bin Zhang, Xiong Wang, Yunfei Chu, and Junyang Lin. Qwen2.5-omni technical report.arXiv preprint arXiv:2503.20215, 2025

  48. [48]

    Using human feedback to fine-tune diffusion models without any reward model

    Kai Yang, Jian Tao, Jiafei Lyu, Chunjiang Ge, Jiaxin Chen, Weihan Shen, Xiaolong Zhu, and Xiu Li. Using human feedback to fine-tune diffusion models without any reward model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8941–8951, 2024

  49. [49]

    Self-play fine-tuning of diffusion models for text-to-image generation.Advances in Neural Information Processing Systems, 37: 73366–73398, 2024

    Huizhuo Yuan, Zixiang Chen, Kaixuan Ji, and Quanquan Gu. Self-play fine-tuning of diffusion models for text-to-image generation.Advances in Neural Information Processing Systems, 37: 73366–73398, 2024

  50. [50]

    Onlinevpo: Align video diffusion model with online video-centric preference optimization,

    Jiacheng Zhang, Jie Wu, Weifeng Chen, Yatai Ji, Xuefeng Xiao, Weilin Huang, and Kai Han. Onlinevpo: Align video diffusion model with online video-centric preference optimization. arXiv preprint arXiv:2412.15159, 2024

  51. [51]

    Audio-synchronized visual animation

    Lin Zhang, Shentong Mo, Yijing Zhang, and Pedro Morgado. Audio-synchronized visual animation. InEuropean Conference on Computer Vision, pages 1–18. Springer, 2024

  52. [52]

    Uniform: A unified multi-task diffusion transformer for audio-video generation

    Lei Zhao, Linfeng Feng, Dongxu Ge, Rujin Chen, Fangqiu Yi, Chi Zhang, Xiao-Lei Zhang, and Xuelong Li. Uniform: A unified multi-task diffusion transformer for audio-video generation. arXiv preprint arXiv:2502.03897, 2025. 13 A Experiment details A.1 Benchmarks and training data Benchmarks.We evaluate on four benchmarks covering both human-speech and ambien...