arxiv: 2605.12179 · v1 · submitted 2026-05-12 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

SyncDPO: Enhancing Temporal Synchronization in Video-Audio Joint Generation via Preference Learning

Xin Cheng , Xihua Wang , Ying Ba , Yuyue Wang , Kaisi Guan , Yinbo Wang , Wenpu Li , Ruihua Song

Authors on Pith no claims yet

Pith reviewed 2026-05-13 06:55 UTC · model grok-4.3

classification 💻 cs.CV

keywords video audio generationtemporal synchronizationdirect preference optimizationDPOcurriculum learningmultimodal modelsalignment

0 comments

The pith

SyncDPO uses direct preference optimization with on-the-fly temporal distortions to improve video-audio synchronization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Achieving precise timing between audio events and visual triggers in generated videos remains difficult even when semantic content matches well. Standard supervised fine-tuning relies on mean squared error losses that do not strongly discourage small timing errors. SyncDPO instead frames the problem as preference optimization by generating negative examples through rule-based distortions of the temporal alignment between video and audio. These negatives are created efficiently without extra sampling or human labeling, and training follows a curriculum from large to small misalignments. Objective and subjective tests on four benchmarks confirm gains in alignment quality and in handling unseen data distributions.

Core claim

SyncDPO is a post-training framework that leverages Direct Preference Optimization to enhance temporal sensitivity in video-audio joint generation models. It replaces costly preference pair construction with on-the-fly rule-based strategies that distort temporal structures, and employs curriculum learning to gradually increase the subtlety of misalignments. This approach yields models with improved temporal alignment and better out-of-distribution generalization compared to supervised fine-tuning baselines.

What carries the argument

On-the-fly rule-based negative construction strategies for creating temporally misaligned video-audio pairs to serve as negatives in Direct Preference Optimization, supported by a progressive curriculum.

If this is right

The resulting models exhibit stronger temporal alignment on in-distribution benchmarks.
Generalization improves on out-of-distribution test sets by better capturing motion-sound relationships.
Training remains computationally efficient by avoiding separate sampling and ranking steps.
The method applies across varied domains including ambient sounds and speech videos.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar on-the-fly negative construction could accelerate preference tuning in other sequence alignment problems.
The curriculum design may offer a template for stabilizing preference optimization when negative quality varies.
If effective, this reduces reliance on human or model-based ranking for creating training signals in multimodal settings.

Load-bearing premise

The synthetic temporal distortions generated by the rules are representative enough of real misalignments to train the model effectively through preference comparisons.

What would settle it

Observing no statistically significant difference in temporal synchronization metrics between SyncDPO and baseline models when evaluated on the four benchmarks would indicate the approach does not deliver the claimed improvements.

Figures

Figures reproduced from arXiv: 2605.12179 by Kaisi Guan, Ruihua Song, Wenpu Li, Xihua Wang, Xin Cheng, Yinbo Wang, Ying Ba, Yuyue Wang.

**Figure 2.** Figure 2: Visualization of on-the-fly negative construction for SyncDPO. The video or audio modality [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Human preference evaluation and qualitative comparison between SFT and SyncDPO. [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

read the original abstract

Recent advancements in video-audio joint generation have achieved remarkable success in semantic correspondence. However, achieving precise temporal synchronization, which requires fine-grained alignment between audio events and their visual triggers, remains a challenging problem. The post-training method for joint generation is largely dominated by Supervised Fine-Tuning, but the commonly used Mean Squared Error loss provides insufficient penalties for subtle temporal misalignments. Direct Preference Optimization offers an alternative by introducing explicit misaligned counterparts to better improve temporal sensitivity. In this paper we propose a post-training framework SyncDPO, leveraging DPO to improve the temporal sensitivity of V-A joint generation. Conventional DPO pipelines typically depend on costly sampling-and-ranking procedures to construct preference pairs, resulting in substantial computational cost. To improve efficiency, we introduce a suite of on-the-fly rule-based negative construction strategies that distort temporal structures without incurring additional annotation or sampling. We demonstrate that the temporal alignment capability can be effectively reinforced by providing explicit negative supervision through temporally distorted V-A pairs. Accordingly, we implement a curriculum learning strategy that progressively increases the difficulty of negative samples, transitioning from coarse misalignment to subtle inconsistencies. Extensive objective and subjective experiments across four diverse benchmarks, ranging from ambient sound videos to human speech videos, demonstrate that SyncDPO significantly outperforms other methods in improving model's temporal alignment capability. It also demonstrates superior generalization on out-of-distribution benchmark by capturing intrinsic motion-sound dynamics. Demo and code is available in https://syncdpo.github.io/syncdpo/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SyncDPO adapts DPO to video-audio generation with on-the-fly rule-based temporal distortions and curriculum learning, claiming better alignment on benchmarks, but the match between synthetic negatives and real model errors remains an open question.

read the letter

The paper's main move is replacing the usual expensive sampling-and-ranking step in DPO with simple on-the-fly rules that shift, stretch, or drop audio-visual timing to create negative pairs. They layer on a curriculum that starts with obvious misalignments and moves to subtler ones. This targets the gap where standard SFT with MSE fails to penalize fine-grained desyncs in joint video-audio models. The abstract reports gains across four benchmarks and stronger out-of-distribution results, plus they ship code and demos, which is helpful for checking the implementation. That combination of efficiency and targeted supervision is the practical advance here. The central assumption is that these rule-based distortions supply negatives whose error distribution is close enough to what the base model actually produces during generation. If the synthetic cases miss the correlated artifacts that arise in diffusion or autoregressive sampling, the preference signal could optimize for the wrong thing. The abstract gives no concrete metrics, baseline tables, or statistical tests, so the size and reliability of the reported improvements are hard to judge from what's here. The stress-test concern about distributional mismatch looks like the spot that needs the most scrutiny in the full methods and results. This is for people already working on post-training for multimodal generation who want a lighter way to add temporal sensitivity. A reader focused on alignment techniques or video-audio models would find the framework worth looking at. It deserves peer review because the idea is concrete, the efficiency claim is testable, and the code release lowers the barrier to verification, even if the experiments need tighter reporting.

Referee Report

1 major / 2 minor

Summary. The manuscript proposes SyncDPO, a post-training framework that applies Direct Preference Optimization (DPO) to video-audio joint generation models to improve temporal synchronization. It replaces costly sampling-and-ranking for preference pairs with on-the-fly rule-based negative construction strategies that apply temporal distortions such as shifts, stretches, and drops. A curriculum learning schedule progressively increases negative difficulty from coarse to subtle misalignments. The authors claim this yields superior temporal alignment on four benchmarks spanning ambient sound and human speech videos, plus better out-of-distribution generalization, with code and demos released.

Significance. If the results hold, the work offers an efficient route to post-train multimodal generators for fine-grained timing without extra annotation or heavy inference-time sampling. The explicit release of code and demo is a clear strength for reproducibility. The approach targets a genuine gap where semantic correspondence is already strong but precise event-level synchronization remains weak.

major comments (1)

[Methods (negative construction strategies)] The central assumption that the on-the-fly rule-based negative construction strategies (temporal shifts, stretches, drops) produce misalignment distributions equivalent to those arising during actual generative sampling is load-bearing for the entire claim. The manuscript provides no analysis or ablation demonstrating that these synthetic distortions capture the correlated visual-audio artifacts typical of diffusion or autoregressive sampling errors, rather than merely varying difficulty inside an artificial family.

minor comments (2)

[Abstract] The abstract asserts outperformance across benchmarks but omits concrete metric values, baseline names, and any mention of statistical tests or variance, which reduces immediate assessability of the empirical claims.
Notation for the preference pairs and the curriculum schedule should be introduced with explicit equations or pseudocode to improve clarity for readers unfamiliar with the exact distortion operators.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for this valuable comment on our negative construction approach. We provide a detailed response below and will update the manuscript accordingly.

read point-by-point responses

Referee: [Methods (negative construction strategies)] The central assumption that the on-the-fly rule-based negative construction strategies (temporal shifts, stretches, drops) produce misalignment distributions equivalent to those arising during actual generative sampling is load-bearing for the entire claim. The manuscript provides no analysis or ablation demonstrating that these synthetic distortions capture the correlated visual-audio artifacts typical of diffusion or autoregressive sampling errors, rather than merely varying difficulty inside an artificial family.

Authors: We agree that a more rigorous validation of the negative construction strategies would bolster the paper. Our strategies are motivated by common temporal misalignment patterns seen in generated video-audio pairs, including those from diffusion models, such as audio lagging behind visual events or mismatched durations. Although we did not include a direct distributional comparison in the original submission, the superior performance on real benchmarks and OOD generalization suggest that the approach effectively targets relevant misalignment types. In the revised version, we will add an analysis section with examples of base model failures and how our negatives relate to them, along with an ablation on the curriculum stages. This will demonstrate that the distortions are not merely artificial but capture key aspects of the problem. revision: yes

Circularity Check

0 steps flagged

No significant circularity; builds on established DPO with novel rule-based negatives

full rationale

The paper's derivation applies the standard DPO loss to video-audio pairs using explicitly defined on-the-fly rule-based distortions (temporal shifts, stretches, drops) for negative samples and a curriculum schedule. These construction rules are stated as independent of model outputs and do not reduce any claimed prediction to a fitted parameter or self-definition. No load-bearing self-citations, uniqueness theorems, or smuggled ansatzes appear in the method; the central claim of improved temporal sensitivity rests on the new negative-construction procedure rather than on prior author results. Experimental results are reported as external validation, not as part of the derivation chain itself.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Relies on standard DPO assumptions and the effectiveness of rule-based distortions as proxies for misalignment; no explicit free parameters or invented entities detailed in abstract.

axioms (1)

domain assumption Direct Preference Optimization improves sensitivity to temporal misalignments when provided with explicit negative pairs
Core premise drawn from prior DPO literature applied here to video-audio generation.

pith-pipeline@v0.9.0 · 5585 in / 1139 out tokens · 68617 ms · 2026-05-13T06:55:32.946570+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/ArithmeticFromLogic.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We introduce a suite of on-the-fly rule-based negative construction strategies that distort temporal structures... curriculum learning strategy that progressively increases the difficulty of negative samples
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

LDPO(πθ;πref) = −E log σ(β log πθ(xw|y)/πref(xw|y) − β log πθ(xl|y)/πref(xl|y))

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

52 extracted references · 52 canonical work pages · 8 internal anchors

[1]

Deep audio-visual speech recognition.IEEE transactions on pattern analysis and machine intelligence, 44(12):8717–8727, 2018

Triantafyllos Afouras, Joon Son Chung, Andrew Senior, Oriol Vinyals, and Andrew Zisserman. Deep audio-visual speech recognition.IEEE transactions on pattern analysis and machine intelligence, 44(12):8717–8727, 2018

work page 2018
[2]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report. a...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Curriculum learning

Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. Curriculum learning. InProceedings of the 26th annual international conference on machine learning, pages 41–48, 2009

work page 2009
[4]

Training Diffusion Models with Reinforcement Learning

Kevin Black, Michael Janner, Yilun Du, Ilya Kostrikov, and Sergey Levine. Training diffusion models with reinforcement learning.arXiv preprint arXiv:2305.13301, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[5]

Video generation models as world simulators

Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, Clarence Ng, Ricky Wang, and Aditya Ramesh. Video generation models as world simulators. 2024. URL https://openai.com/research/ video-generation-models-as-world-simulators

work page 2024
[6]

Quo vadis, action recognition? a new model and the kinetics dataset

Joao Carreira and Andrew Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. Inproceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6299–6308, 2017

work page 2017
[7]

Vggsound: A large-scale audio-visual dataset

Honglie Chen, Weidi Xie, Andrea Vedaldi, and Andrew Zisserman. Vggsound: A large-scale audio-visual dataset. InICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 721–725. IEEE, 2020

work page 2020
[8]

MMAudio: Taming multimodal joint training for high-quality video-to-audio synthesis

Ho Kei Cheng, Masato Ishii, Akio Hayakawa, Takashi Shibuya, Alexander Schwing, and Yuki Mitsufuji. MMAudio: Taming multimodal joint training for high-quality video-to-audio synthesis. InCVPR, 2025

work page 2025
[9]

Lova: Long-form video-to-audio generation.arXiv preprint arXiv:2409.15157, 2024

Xin Cheng, Xihua Wang, Yihan Wu, Yuyue Wang, and Ruihua Song. Lova: Long-form video-to-audio generation.arXiv preprint arXiv:2409.15157, 2024

work page arXiv 2024
[10]

Vssflow: Unifying video-conditioned sound and speech generation via joint learning.arXiv preprint arXiv:2509.24773, 2025

Xin Cheng, Yuyue Wang, Xihua Wang, Yihan Wu, Kaisi Guan, Yijing Chen, Peng Zhang, Xiaojiang Liu, Meng Cao, and Ruihua Song. Vssflow: Unifying video-conditioned sound and speech generation via joint learning.arXiv preprint arXiv:2509.24773, 2025

work page arXiv 2025
[11]

J. S. Chung and A. Zisserman. Out of time: automated lip sync in the wild. InWorkshop on Multi-view Lip-reading, ACCV, 2016

work page 2016
[12]

Curriculum direct preference optimization for diffusion and consistency models

Florinel-Alin Croitoru, Vlad Hondru, Radu Tudor Ionescu, Nicu Sebe, and Mubarak Shah. Curriculum direct preference optimization for diffusion and consistency models. InProceedings of CVPR, page [TBA], 2025

work page 2025
[13]

Clap learning audio concepts from natural language supervision

Benjamin Elizalde, Soham Deshmukh, Mahmoud Al Ismail, and Huaming Wang. Clap learning audio concepts from natural language supervision. InICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2023. 10

work page 2023
[14]

Dpok: Reinforcement learning for fine-tuning text-to-image diffusion models.Advances in Neural Information Processing Systems, 36:79858–79885, 2023

Ying Fan, Olivia Watkins, Yuqing Du, Hao Liu, Moonkyung Ryu, Craig Boutilier, Pieter Abbeel, Mohammad Ghavamzadeh, Kangwook Lee, and Kimin Lee. Dpok: Reinforcement learning for fine-tuning text-to-image diffusion models.Advances in Neural Information Processing Systems, 36:79858–79885, 2023

work page 2023
[15]

Imagebind: One embedding space to bind them all

Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, and Ishan Misra. Imagebind: One embedding space to bind them all. InCVPR, 2023

work page 2023
[16]

Veo 3 AI Video Generator with Realistic Sound.https://www.veo3.io/, 2025

Google. Veo 3 AI Video Generator with Realistic Sound.https://www.veo3.io/, 2025

work page 2025
[17]

Taming text-to-sounding video generation via advanced modality condition and interaction, 2025

Kaisi Guan, Xihua Wang, Zhengfeng Lai, Xin Cheng, Peng Zhang, XiaoJiang Liu, Ruihua Song, and Meng Cao. Taming text-to-sounding video generation via advanced modality condition and interaction.arXiv preprint arXiv:2510.03117, 2025

work page arXiv 2025
[18]

LTX-2: Efficient Joint Audio-Visual Foundation Model

Yoav HaCohen, Benny Brazowski, Nisan Chiprut, Yaki Bitterman, Andrew Kvochko, Avishai Berkowitz, Daniel Shalem, Daphna Lifschitz, Dudu Moshe, Eitan Porat, et al. Ltx-2: Efficient joint audio-visual foundation model.arXiv preprint arXiv:2601.03233, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[19]

VABench: A Comprehensive Benchmark for Audio-Video Generation

Daili Hua, Xizhi Wang, Bohan Zeng, Xinyi Huang, Hao Liang, Junbo Niu, Xinlong Chen, Quanqing Xu, and Wentao Zhang. Vabench: A comprehensive benchmark for audio-video generation.arXiv preprint arXiv:2512.09299, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[20]

Jova: Unified multimodal learning for joint video-audio generation.arXiv preprint arXiv:2512.13677, 2025

Xiaohu Huang, Hao Zhou, Qiangpeng Yang, Shilei Wen, and Kai Han. Jova: Unified multimodal learning for joint video-audio generation.arXiv preprint arXiv:2512.13677, 2025

work page arXiv 2025
[21]

Synchformer: Efficient synchronization from sparse cues

Vladimir Iashin, Weidi Xie, Esa Rahtu, and Andrew Zisserman. Synchformer: Efficient synchronization from sparse cues. InICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5325–5329. IEEE, 2024

work page 2024
[22]

Panns: Large-scale pretrained audio neural networks for audio pattern recognition.IEEE/ACM Transactions on Audio, Speech, and Language Processing, 28:2880–2894, 2020

Qiuqiang Kong, Yin Cao, Turab Iqbal, Yuxuan Wang, Wenwu Wang, and Mark D Plumbley. Panns: Large-scale pretrained audio neural networks for audio pattern recognition.IEEE/ACM Transactions on Audio, Speech, and Language Processing, 28:2880–2894, 2020

work page 2020
[23]

Efficient training of audio transformers with patchout

Khaled Koutini, Jan Schlüter, Hamid Eghbal-zadeh, and Gerhard Widmer. Efficient training of audio transformers with patchout. InInterspeech 2022, 23rd Annual Conference of the International Speech Communication Association, Incheon, Korea, 18-22 September 2022, pages 2753–2757. ISCA, 2022. doi: 10.21437/Interspeech.2022-227

work page doi:10.21437/interspeech.2022-227 2022
[24]

Aesthetic post-training diffusion models from generic preferences with step- by-step preference optimization

Zhanhao Liang, Yuhui Yuan, Shuyang Gu, Bohan Chen, Tiankai Hang, Mingxi Cheng, Ji Li, and Liang Zheng. Aesthetic post-training diffusion models from generic preferences with step- by-step preference optimization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13199–13208, 2025

work page 2025
[25]

Flow Matching for Generative Modeling

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[26]

Improving Video Generation with Human Feedback

Jie Liu, Gongye Liu, Jiajun Liang, Ziyang Yuan, Xiaokun Liu, Mingwu Zheng, Xiele Wu, Qiulin Wang, Menghan Xia, Xintao Wang, et al. Improving video generation with human feedback.arXiv preprint arXiv:2501.13918, 2025

work page internal anchor Pith review arXiv 2025
[27]

Javisdit: Joint audio-video diffusion transformer with hierarchical spatio-temporal prior synchronization.arXiv preprint arXiv:2503.23377, 2025

Kai Liu, Wei Li, Lai Chen, Shengqiong Wu, Yanhao Zheng, Jiayi Ji, Fan Zhou, Rongxin Jiang, Jiebo Luo, Hao Fei, et al. Javisdit: Joint audio-video diffusion transformer with hierarchical spatio-temporal prior synchronization.arXiv preprint arXiv:2503.23377, 2025

work page arXiv 2025
[28]

Javisdit++: Unified modeling and optimization for joint audio-video generation.arXiv preprint arXiv:2602.19163, 2026

Kai Liu, Yanhao Zheng, Kai Wang, Shengqiong Wu, Rongjunchen Zhang, Jiebo Luo, Dimitrios Hatzinakos, Ziwei Liu, Hao Fei, and Tat-Seng Chua. Javisdit++: Unified modeling and optimization for joint audio-video generation.arXiv preprint arXiv:2602.19163, 2026

work page arXiv 2026
[29]

Videodpo: Omni-preference alignment for video diffusion generation

Runtao Liu, Haoyu Wu, Ziqiang Zheng, Chen Wei, Yingqing He, Renjie Pi, and Qifeng Chen. Videodpo: Omni-preference alignment for video diffusion generation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 8009–8019, 2025. 11

work page 2025
[30]

Ovi: Twin backbone cross-modal fusion for audio-video generation

Chetwin Low, Weimin Wang, and Calder Katyal. Ovi: Twin backbone cross-modal fusion for audio-video generation.arXiv preprint arXiv:2510.01284, 2025

work page arXiv 2025
[31]

Visually indicated sounds

Andrew Owens, Phillip Isola, Josh McDermott, Antonio Torralba, Edward H Adelson, and William T Freeman. Visually indicated sounds. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 2405–2413, 2016

work page 2016
[32]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021

work page 2021
[33]

Robust speech recognition via large-scale weak supervision

Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision. InInternational conference on machine learning, pages 28492–28518. PMLR, 2023

work page 2023
[34]

Direct preference optimization: Your language model is secretly a reward model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in neural information processing systems, 36:53728–53741, 2023

work page 2023
[35]

Mm-diffusion: Learning multi-modal diffusion models for joint audio and video generation

Ludan Ruan, Yiyang Ma, Huan Yang, Huiguo He, Bei Liu, Jianlong Fu, Nicholas Jing Yuan, Qin Jin, and Baining Guo. Mm-diffusion: Learning multi-modal diffusion models for joint audio and video generation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10219–10228, 2023

work page 2023
[36]

Utmos: Utokyo-sarulab sys- tem for voicemos challenge 2022.arXiv preprint arXiv:2204.02152,

Takaaki Saeki, Detai Xin, Wataru Nakata, Tomoki Koriyama, Shinnosuke Takamichi, and Hiroshi Saruwatari. Utmos: Utokyo-sarulab system for voicemos challenge 2022.arXiv preprint arXiv:2204.02152, 2022

work page arXiv 2022
[37]

Seedance 1.5 pro: A native audio-visual joint generation foundation model.arXiv preprint arXiv:2512.13507, 2025

Team Seedance, Heyi Chen, Siyan Chen, Xin Chen, Yanfei Chen, Ying Chen, Zhuo Chen, Feng Cheng, Tianheng Cheng, Xinqi Cheng, et al. Seedance 1.5 pro: A native audio-visual joint generation foundation model.arXiv preprint arXiv:2512.13507, 2025

work page arXiv 2025
[38]

Any-to-any generation via composable diffusion.Advances in Neural Information Processing Systems, 36: 16083–16099, 2023

Zineng Tang, Ziyi Yang, Chenguang Zhu, Michael Zeng, and Mohit Bansal. Any-to-any generation via composable diffusion.Advances in Neural Information Processing Systems, 36: 16083–16099, 2023

work page 2023
[39]

Mochi 1.https://github.com/genmoai/models, 2024

Genmo Team. Mochi 1.https://github.com/genmoai/models, 2024

work page 2024
[40]

Diffusion model alignment using direct preference optimization

Bram Wallace, Meihua Dang, Rafael Rafailov, Linqi Zhou, Aaron Lou, Senthil Purushwalkam, Stefano Ermon, Caiming Xiong, Shafiq Joty, and Nikhil Naik. Diffusion model alignment using direct preference optimization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8228–8238, 2024

work page 2024
[41]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[42]

Universe-1: Unified audio-video generation via stitching of experts.arXiv preprint arXiv:2509.06155, 2025

Duomin Wang, Wei Zuo, Aojie Li, Ling-Hao Chen, Xinyao Liao, Deyu Zhou, Zixin Yin, Xili Dai, Daxin Jiang, and Gang Yu. Universe-1: Unified audio-video generation via stitching of experts.arXiv preprint arXiv:2509.06155, 2025

work page arXiv 2025
[43]

Av-dit: Effi- cient audio-visual diffusion transformer for joint audio and video generation.arXiv preprint arXiv:2406.07686, 2024

Kai Wang, Shijian Deng, Jing Shi, Dimitrios Hatzinakos, and Yapeng Tian. Av-dit: Effi- cient audio-visual diffusion transformer for joint audio and video generation.arXiv preprint arXiv:2406.07686, 2024

work page arXiv 2024
[44]

Koala-36m: A large-scale video dataset improving consistency between fine-grained conditions and video content

Qiuheng Wang, Yukai Shi, Jiarong Ou, Rui Chen, Ke Lin, Jiahao Wang, Boyuan Jiang, Haotian Yang, Mingwu Zheng, Xin Tao, et al. Koala-36m: A large-scale video dataset improving consistency between fine-grained conditions and video content. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 8428–8437, 2025

work page 2025
[45]

Animate and sound an image

Xihua Wang, Ruihua Song, Chongxuan Li, Xin Cheng, Boyuan Li, Yihan Wu, Yuyue Wang, Hongteng Xu, and Yunfeng Wang. Animate and sound an image. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 23369–23378, 2025. 12

work page 2025
[46]

ESPnet: End-to-end speech processing toolkit

Shinji Watanabe, Takaaki Hori, Shigeki Karita, Tomoki Hayashi, Jiro Nishitoba, Yuya Unno, Nelson Enrique Yalta Soplin, Jahn Heymann, Matthew Wiesner, Nanxin Chen, Adithya Ren- duchintala, and Tsubasa Ochiai. ESPnet: End-to-end speech processing toolkit. InProceed- ings of Interspeech, pages 2207–2211, 2018. doi: 10.21437/Interspeech.2018-1456. URL http://...

work page doi:10.21437/interspeech.2018-1456 2018
[47]

Qwen2.5-Omni Technical Report

Jin Xu, Zhifang Guo, Jinzheng He, Hangrui Hu, Ting He, Shuai Bai, Keqin Chen, Jialin Wang, Yang Fan, Kai Dang, Bin Zhang, Xiong Wang, Yunfei Chu, and Junyang Lin. Qwen2.5-omni technical report.arXiv preprint arXiv:2503.20215, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[48]

Using human feedback to fine-tune diffusion models without any reward model

Kai Yang, Jian Tao, Jiafei Lyu, Chunjiang Ge, Jiaxin Chen, Weihan Shen, Xiaolong Zhu, and Xiu Li. Using human feedback to fine-tune diffusion models without any reward model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8941–8951, 2024

work page 2024
[49]

Self-play fine-tuning of diffusion models for text-to-image generation.Advances in Neural Information Processing Systems, 37: 73366–73398, 2024

Huizhuo Yuan, Zixiang Chen, Kaixuan Ji, and Quanquan Gu. Self-play fine-tuning of diffusion models for text-to-image generation.Advances in Neural Information Processing Systems, 37: 73366–73398, 2024

work page 2024
[50]

Onlinevpo: Align video diffusion model with online video-centric preference optimization,

Jiacheng Zhang, Jie Wu, Weifeng Chen, Yatai Ji, Xuefeng Xiao, Weilin Huang, and Kai Han. Onlinevpo: Align video diffusion model with online video-centric preference optimization. arXiv preprint arXiv:2412.15159, 2024

work page arXiv 2024
[51]

Audio-synchronized visual animation

Lin Zhang, Shentong Mo, Yijing Zhang, and Pedro Morgado. Audio-synchronized visual animation. InEuropean Conference on Computer Vision, pages 1–18. Springer, 2024

work page 2024
[52]

Uniform: A unified multi-task diffusion transformer for audio-video generation

Lei Zhao, Linfeng Feng, Dongxu Ge, Rujin Chen, Fangqiu Yi, Chi Zhang, Xiao-Lei Zhang, and Xuelong Li. Uniform: A unified multi-task diffusion transformer for audio-video generation. arXiv preprint arXiv:2502.03897, 2025. 13 A Experiment details A.1 Benchmarks and training data Benchmarks.We evaluate on four benchmarks covering both human-speech and ambien...

work page arXiv 2025