V2M-Zero: Zero-Pair Time-Aligned Video-to-Music Generation
Pith reviewed 2026-05-15 13:15 UTC · model grok-4.3
The pith
Event curves from intra-modal similarities enable zero-pair training for time-aligned video-to-music generation.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We introduce V2M-ZERO, a video-to-music generation approach that generates time-aligned music with disentangled time synchronization and semantic control from video while requiring zero video-music pairs at training time. Temporal synchronization requires matching when and how much change occurs, not what changes. Shared temporal structure is captured independently within each modality through event curves computed from intra-modal similarity using pretrained music and video encoders. These curves provide comparable representations across modalities, enabling a training strategy of fine-tuning a text-to-music model on music-event curves and substituting video-event curves at inference. This
What carries the argument
Event curves computed from intra-modal similarity using pretrained encoders, which capture the timing and magnitude of changes independently per modality to allow direct substitution for alignment.
If this is right
- Surpasses prior baselines with 5-9% higher audio quality, 13-15% better semantic alignment, and 21-52% improved temporal synchronization without any paired data.
- Achieves 28% higher beat alignment on dance videos from the AIST++ benchmark.
- Delivers comparable gains in large-scale crowd-sourced subjective listening tests.
- Enables separate controls for timing via event curves and for style via text prompts such as genre or mood.
- Validates that within-modality temporal features outperform paired cross-modal supervision for this task.
Where Pith is reading between the lines
- The intra-modal curve approach could transfer to other cross-modal tasks where paired data is limited, such as aligning generated audio to text descriptions or images.
- Treating temporal structure as modality-independent may lower data requirements for training multimodal generators in domains like film scoring or dance music synthesis.
- If the substitution works reliably, it opens a path to iterative refinement where timing curves are edited independently of semantic prompts.
Load-bearing premise
Event curves from intra-modal similarities using pretrained encoders provide comparable representations across video and music modalities that can be directly substituted at inference without cross-modal training.
What would settle it
On benchmarks such as OES-Pub or MovieGenBench-Music, if models using substituted video event curves show no improvement or worse temporal synchronization and beat alignment metrics than the strongest paired cross-modal baselines, the claim is falsified.
Figures
read the original abstract
Generating music that temporally aligns with video events is challenging for existing text-to-music models, which lack fine-grained temporal control. We introduce V2M-ZERO, a video-to-music generation approach that generates time-aligned music with disentangled time synchronization and semantic control (e.g., genre, mood) from video while requiring zero video-music pairs at training time. Our method is motivated by a key observation: temporal synchronization requires matching when and how much change occurs, not what changes. While musical and visual events differ semantically, they exhibit shared temporal structure that can be captured independently within each modality. We capture this structure through event curves computed from intra-modal similarity using pretrained music and video encoders. By measuring temporal change within each modality independently, these curves provide comparable representations across modalities. This enables a simple training strategy: fine-tune a text-to-music model on music-event curves, then substitute video-event curves at inference without cross-modal training or paired data. Across OES-Pub, MovieGenBench-Music, and AIST++, V2M-ZERO achieves state-of-the-art performance without any paired music-video data, surpassing the strongest prior baselines per metric with 5-9% higher audio quality, 13-15% better semantic alignment, 21-52% improved temporal synchronization, and 28% higher beat alignment on dance videos. We find similar results via a large crowd-source subjective listening test. Our results validate that temporal alignment through within-modality features is not only effective for video-to-music generation but also leads to better performance than paired cross-modal supervision. Furthermore, our approach enables independent controls for timing and music style (e.g., genre, mood) for more controllable generation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents V2M-Zero, a zero-pair method for video-to-music generation that achieves temporal alignment by computing event curves independently from intra-modal similarities using pretrained video and music encoders. It fine-tunes a text-to-music model on music event curves only, then substitutes video event curves at inference to control timing while allowing separate semantic control (e.g., genre, mood). The approach reports state-of-the-art results across OES-Pub, MovieGenBench-Music, and AIST++ with gains of 5-9% in audio quality, 13-15% in semantic alignment, 21-52% in temporal synchronization, and 28% in beat alignment, plus subjective validation, claiming superiority over paired cross-modal supervision.
Significance. If the central claim holds, the work would be significant for showing that within-modality temporal structures can substitute for paired data in cross-modal alignment tasks, enabling more controllable generation and reducing dependence on expensive paired datasets. This has potential implications for zero-shot multimodal synthesis in computer vision and audio, provided the event-curve equivalence is validated.
major comments (1)
- [Method (event curve substitution)] Method (event curve substitution, as described in abstract and method): The claim that event curves from pretrained video and music encoders provide comparable representations for direct substitution rests on the unverified assumption of shared temporal structure without reported normalization, moment-matching, distributional comparison, or dynamic-range analysis between modalities. This is load-bearing for the zero-pair training strategy and the reported temporal gains (21-52%); unaccounted domain shift could attribute improvements to model robustness instead.
minor comments (2)
- [Abstract and results] Abstract and results: Exact baselines, statistical significance tests, and controls for data leakage are not detailed for the reported metric improvements across the three benchmarks.
- [Subjective evaluation] Subjective evaluation: Additional specifics on the crowd-sourced listening test (participant count, rating scales, and statistical analysis) would strengthen the validation claims.
Simulated Author's Rebuttal
We thank the referee for their thoughtful review and for identifying a key methodological assumption in V2M-Zero. We address the concern point-by-point below and will incorporate additional analysis in the revision.
read point-by-point responses
-
Referee: The claim that event curves from pretrained video and music encoders provide comparable representations for direct substitution rests on the unverified assumption of shared temporal structure without reported normalization, moment-matching, distributional comparison, or dynamic-range analysis between modalities. This is load-bearing for the zero-pair training strategy and the reported temporal gains (21-52%); unaccounted domain shift could attribute improvements to model robustness instead.
Authors: We agree that explicit verification strengthens the central claim. Event curves are computed identically within each modality: given a sequence of embeddings from a pretrained encoder, we form a temporal similarity matrix and derive a normalized change curve (scaled to [0,1] by the maximum intra-sequence difference). This formulation ensures both curves measure the same quantity—magnitude and timing of change—independent of semantic content. While the original submission did not include side-by-side distributional plots, the 21-52% temporal synchronization gains over paired baselines across three datasets would be unlikely if domain shift were dominant; such shift would typically harm rather than enhance alignment. In the revised manuscript we will add a dedicated subsection with (i) normalization details, (ii) histogram comparisons of video vs. music curve values, and (iii) dynamic-range statistics, confirming the curves occupy statistically similar ranges. revision: yes
Circularity Check
No significant circularity; derivation is self-contained via independent intra-modal computations
full rationale
The paper computes event curves separately within each modality from pretrained encoders, fine-tunes the text-to-music model exclusively on music-derived curves, and substitutes video curves at inference under the explicit assumption of shared temporal structure. No equation reduces the alignment metric or generated output to a fitted parameter defined from the same data or to a self-citation chain. The central zero-pair claim rests on empirical validation across separate benchmarks rather than on any definitional equivalence or imported uniqueness theorem. This is the expected non-finding for a method whose core steps remain externally verifiable.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Event curves derived from intra-modal similarity in pretrained video and music encoders capture comparable temporal structure across modalities.
invented entities (1)
-
event curves
no independent evidence
Reference graph
Works this paper leans on
-
[1]
MusicLM: Generating music from text.arXiv Preprint, 2023
Andrea Agostinelli, Timo I Denk, Zalán Borsos, Jesse Engel, Mauro Verzetti, An- toine Caillon, Qingqing Huang, Aren Jansen, Adam Roberts, Marco Tagliasacchi, et al. MusicLM: Generating music from text.arXiv Preprint, 2023. 1, 4
work page 2023
-
[2]
Mido Assran, Adrien Bardes, David Fan, Quentin Garrido, Russell Howes, Matthew Muckley, Ammar Rizvi, Claire Roberts, Koustuv Sinha, Artem Zholus, et al. V-JEPA 2: Self-supervised video models enable understanding, prediction and planning.arXiv Preprint, 2025. 8, 13
work page 2025
-
[3]
Yatong Bai, Jonah Casebeer, Somayeh Sojoudi, and Nicholas J. Bryan. DRAGON: Distributional rewards optimize diffusion generative models.TMLR,
-
[4]
AudioLM: a language modeling approach to audio generation
Zalán Borsos, Raphaël Marinier, Damien Vincent, Eugene Kharitonov, Olivier Pietquin, Matt Sharifi, Dominik Roblek, Olivier Teboul, David Grangier, Marco Tagliasacchi, et al. AudioLM: a language modeling approach to audio generation. TASLP, 2023. 4
work page 2023
-
[5]
Re-bottleneck: Latent re-structuring for neural audio autoencoders
Dimitrios Bralios, Jonah Casebeer, and Paris Smaragdis. Re-bottleneck: Latent re-structuring for neural audio autoencoders. InMLSP, 2025. 8
work page 2025
-
[6]
Learning to upsample and upmix audio in the latent domain.arXiv Preprint, 2025
Dimitrios Bralios, Paris Smaragdis, and Jonah Casebeer. Learning to upsample and upmix audio in the latent domain.arXiv Preprint, 2025. 8
work page 2025
-
[7]
A generative-first neural audio autoencoder.arXiv:2602.15749, 2026
Jonah Casebeer, Ge Zhu, Zhepei Wang, and Nicholas J Bryan. A generative-first neural audio autoencoder.arXiv:2602.15749, 2026. 5, 13
-
[8]
Visually indicated sound generation by perceptually optimized classifi- cation
Kan Chen, Chuanxi Zhang, Chen Fang, Zhaowen Wang, Trung Bui, and Ram Nevatia. Visually indicated sound generation by perceptually optimized classifi- cation. InECCVW, 2018. 4
work page 2018
-
[9]
Generating visually aligned sound from videos.TIP, 2020
Peihao Chen, Yang Zhang, Mingkui Tan, Hongdong Xiao, Deng Huang, and Chuang Gan. Generating visually aligned sound from videos.TIP, 2020. 4
work page 2020
-
[10]
Images that sound: Composing images and sounds on a single canvas
Ziyang Chen, Daniel Geng, and Andrew Owens. Images that sound: Composing images and sounds on a single canvas. InNeurIPS, 2024. 4
work page 2024
-
[11]
Video-guided foley sound generation with multimodal controls
Ziyang Chen, Prem Seetharaman, Bryan Russell, Oriol Nieto, David Bourgin, Andrew Owens, and Justin Salamon. Video-guided foley sound generation with multimodal controls. InCVPR, 2025. 4
work page 2025
-
[12]
Mmaudio: Taming multimodal joint training for high-quality video-to-audio synthesis
Ho Kei Cheng, Masato Ishii, Akio Hayakawa, Takashi Shibuya, Alexander Schwing, and Yuki Mitsufuji. Mmaudio: Taming multimodal joint training for high-quality video-to-audio synthesis. InCVPR, 2025. 4
work page 2025
-
[13]
MeLFusion: Synthesizing music from image and language cues using diffusion models
Sanjoy Chowdhury, Sayan Nag, KJ Joseph, Balaji Vasan Srinivasan, and Dinesh Manocha. MeLFusion: Synthesizing music from image and language cues using diffusion models. InCVPR, 2024. 2, 4
work page 2024
-
[14]
Ruben Ciranni, Giorgio Mariani, Michele Mancusi, Emilian Postolache, Giorgio Fabbro, Emanuele Rodolà, and Luca Cosmo. Cocola: Coherence-oriented con- trastive learning of musical audio representations.arXiv Preprint, 2024. 12
work page 2024
-
[15]
Simple and controllable music generation
Jade Copet, Felix Kreuk, Itai Gat, Tal Remez, David Kant, Gabriel Synnaeve, Yossi Adi, and Alexandre Défossez. Simple and controllable music generation. In NeurIPS, 2023. 1, 4
work page 2023
-
[16]
High fidelity neural audio compression.arXiv Preprint, 2022
Alexandre Défossez, Jade Copet, Gabriel Synnaeve, and Yossi Adi. High fidelity neural audio compression.arXiv Preprint, 2022. 4
work page 2022
-
[17]
Video background music generation with controllable music transformer
Shangzhe Di, Zeren Jiang, Si Liu, Zhaokai Wang, Leyan Zhu, Zexin He, Hongming Liu, and Shuicheng Yan. Video background music generation with controllable music transformer. InACM MM, 2021. 4, 12
work page 2021
-
[18]
Conditional generation of audio from video via foley analogies
Yuexi Du, Ziyang Chen, Justin Salamon, Bryan Russell, and Andrew Owens. Conditional generation of audio from video via foley analogies. InCVPR, 2023. 4 22 Lin et al
work page 2023
-
[19]
Fast timing- conditioned latent audio diffusion.arXiv Preprint, 2024
Zach Evans, CJ Carr, Josiah Taylor, Scott H Hawley, and Jordi Pons. Fast timing- conditioned latent audio diffusion.arXiv Preprint, 2024. 4
work page 2024
-
[20]
Zach Evans, Julian D Parker, CJ Carr, Zack Zukowski, Josiah Taylor, and Jordi Pons. Stable audio open. InICASSP, 2025. 4, 6, 8
work page 2025
-
[21]
Flux that plays music.arXiv Preprint, 2024
Zhengcong Fei, Mingyuan Fan, Changqian Yu, and Junshi Huang. Flux that plays music.arXiv Preprint, 2024. 1, 4, 6
work page 2024
-
[22]
Visualizing musical structure and rhythm via self-similarity
Jonathan Foote and Matthew Cooper. Visualizing musical structure and rhythm via self-similarity. InICMC, 2001. 6
work page 2001
-
[23]
Text-to-audio generation using instruction guided latent diffusion model
Deepanway Ghosal, Navonil Majumder, Ambuj Mehrish, and Soujanya Poria. Text-to-audio generation using instruction guided latent diffusion model. InACM MM, 2023. 4, 6
work page 2023
-
[24]
ACE-Step: A step towards music generation foundation model.arXiv Preprint, 2025
Junmin Gong, Sean Zhao, Sen Wang, Shengyuan Xu, and Joe Guo. ACE-Step: A step towards music generation foundation model.arXiv Preprint, 2025. 1, 4
work page 2025
-
[25]
The llama 3 herd of models.arXiv Preprint, 2024
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Ab- hishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv Preprint, 2024. 14
work page 2024
-
[26]
it’s more of a vibe i’m going for
Noor Hammad, C Ailie Fraser, Erik Harpstead, Jessica Hammer, and Mira Dontcheva. “it’s more of a vibe i’m going for”: Designing text-to-music gener- ation interfaces for video creators. InDIS, 2025. 2, 3, 5, 8, 14, 16
work page 2025
-
[27]
Cnn architectures for large-scale audio classification
Shawn Hershey, Sourish Chaudhuri, Daniel PW Ellis, Jort F Gemmeke, Aren Jansen, R Channing Moore, Manoj Plakal, Devin Platt, Rif A Saurous, Bryan Seybold, et al. Cnn architectures for large-scale audio classification. InICASSP,
-
[28]
Classifier-freediffusionguidance.arXiv Preprint,
JonathanHoandTimSalimans. Classifier-freediffusionguidance.arXiv Preprint,
-
[29]
A simple sequentially rejective multiple test procedure.Scandinavian journal of statistics, 1979
Sture Holm. A simple sequentially rejective multiple test procedure.Scandinavian journal of statistics, 1979. 11
work page 1979
-
[30]
Make-an-audio: Text-to- audio generation with prompt-enhanced diffusion models
Rongjie Huang, Jiawei Huang, Dongchao Yang, Yi Ren, Luping Liu, Mingze Li, Zhenhui Ye, Jinglin Liu, Xiang Yin, and Zhou Zhao. Make-an-audio: Text-to- audio generation with prompt-enhanced diffusion models. InICML, 2023. 4
work page 2023
-
[31]
MusiScene: Leveraging mu-llama forscene imaginationand enhancedvideo background music generation
Fathinah Izzati, Xinyue Li, Yuxuan Wu, and Gus Xia. MusiScene: Leveraging mu-llama forscene imaginationand enhancedvideo background music generation. arXiv Preprint, 2025. 4
work page 2025
-
[32]
Video2music: Suitable music generation from videos using an affective multimodal transformer model
Jaeyong Kang, Soujanya Poria, and Dorien Herremans. Video2music: Suitable music generation from videos using an affective multimodal transformer model. arXiv Preprint, 2023. 4
work page 2023
-
[33]
Cotracker: It is better to track together
Nikita Karaev, Ignacio Rocco, Benjamin Graham, Natalia Neverova, Andrea Vedaldi, and Christian Rupprecht. Cotracker: It is better to track together. In ECCV, 2024. 8, 12, 14
work page 2024
-
[34]
Fr\’echet audio distance: A metric for evaluating music enhancement algorithms
Kevin Kilgour, Mauricio Zuluaga, Dominik Roblek, and Matthew Sharifi. Fr\’echet audio distance: A metric for evaluating music enhancement algorithms. arXiv Preprint, 2018. 9
work page 2018
-
[35]
Video-guided text-to-music generation using public domain movie collections
Haven Kim, Zachary Novack, Weihan Xu, Julian McAuley, and Hao-Wen Dong. Video-guided text-to-music generation using public domain movie collections. In ISMIR, 2025. 2, 4, 8, 10
work page 2025
-
[36]
Audiogen: Textually guided audio generation
Felix Kreuk, Gabriel Synnaeve, Adam Polyak, Uriel Singer, Alexandre Défossez, Jade Copet, Devi Parikh, Yaniv Taigman, and Yossi Adi. Audiogen: Textually guided audio generation. InICLR, 2023. 4
work page 2023
-
[37]
High-fidelity audio compression with improved RVQGAN
Rithesh Kumar, Prem Seetharaman, Alejandro Luebs, Ishaan Kumar, and Kun- dan Kumar. High-fidelity audio compression with improved RVQGAN. In NeurIPS, 2023. 4 V2M-Zero23
work page 2023
-
[38]
VinTAGe: Joint video and text conditioning for holistic audio generation
Saksham Singh Kushwaha and Yapeng Tian. VinTAGe: Joint video and text conditioning for holistic audio generation. InCVPR, 2025. 4
work page 2025
-
[39]
Learning self- similarity in space and time as generalized motion for video action recognition
Heeseung Kwon, Manjin Kim, Suha Kwak, and Minsu Cho. Learning self- similarity in space and time as generalized motion for video action recognition. InICCV, 2021. 6
work page 2021
-
[40]
Efficient neural music generation
Max WY Lam, Qiao Tian, Tang Li, Zongyu Yin, Siyuan Feng, Ming Tu, Yuliang Ji, Rui Xia, Mingbo Ma, Xuchen Song, et al. Efficient neural music generation. InNeurIPS, 2023. 1, 4
work page 2023
-
[41]
Hsin-Ying Lee, Xiaodong Yang, Ming-Yu Liu, Ting-Chun Wang, Yu-Ding Lu, Ming-Hsuan Yang, and Jan Kautz. Dancing to music. InNeurIPS, 2019. 4
work page 2019
-
[42]
Jiajun Li, Tianze Xu, Xuesong Chen, Xinrui Yao, Jingchou Han, and Shuchang Liu. Mozart’s touch: a lightweight multimodal music generation framework based on pre-trained large models. InAIGC, 2025. 4
work page 2025
-
[43]
AI choreographer: Music conditioned 3d dance generation with AIST++
Ruilong Li, Shan Yang, David A Ross, and Angjoo Kanazawa. AI choreographer: Music conditioned 3d dance generation with AIST++. InICCV, 2021. 9, 11, 12
work page 2021
-
[44]
Ruiqi Li, Siqi Zheng, Xize Cheng, Ziang Zhang, Shengpeng Ji, and Zhou Zhao. MuVi: Video-to-music generation with semantic alignment and rhythmic synchro- nization.arXiv Preprint, 2024. 2, 4
work page 2024
-
[45]
Dance-to-music generation with encoder- based textual inversion
Sifei Li, Weiming Dong, Yuxin Zhang, Fan Tang, Chongyang Ma, Oliver Deussen, Tong-Yee Lee, and Changsheng Xu. Dance-to-music generation with encoder- based textual inversion. InSIGGRAPH Asia, 2024. 4, 9, 12, 16
work page 2024
-
[46]
Diff-BGM: A diffusion model for video background music generation
Sizhe Li, Yiming Qin, Minghang Zheng, Xin Jin, and Yang Liu. Diff-BGM: A diffusion model for video background music generation. InCVPR, 2024. 4
work page 2024
-
[47]
Sifei Li, Binxin Yang, Chunji Yin, Chong Sun, Yuxin Zhang, Weiming Dong, and Chen Li. VidMusician: Video-to-music generation with semantic-rhythmic alignment via hierarchical visual features.arXiv Preprint, 2024. 2, 4
work page 2024
-
[48]
Siamese vision transformers are scalable audio- visual learners
Yan-Bo Lin and Gedas Bertasius. Siamese vision transformers are scalable audio- visual learners. InECCV, 2024. 8, 13
work page 2024
-
[49]
VMAS: Video-to-music generationviasemanticalignment inweb musicvideos
Yan-Bo Lin, Yu Tian, Linjie Yang, Gedas Bertasius, and Heng Wang. VMAS: Video-to-music generationviasemanticalignment inweb musicvideos. InW ACV,
-
[50]
AudioLDM: Text-to-audio generation with latent diffusion models
Haohe Liu, Zehua Chen, Yi Yuan, Xinhao Mei, Xubo Liu, Danilo Mandic, Wenwu Wang, and Mark D Plumbley. AudioLDM: Text-to-audio generation with latent diffusion models. InICML, 2023. 4, 6
work page 2023
-
[51]
AudioLDM 2: Learn- ing holistic audio generation with self-supervised pretraining.arXiv Preprint,
Haohe Liu, Qiao Tian, Yi Yuan, Xubo Liu, Xinhao Mei, Qiuqiang Kong, Yuping Wang, Wenwu Wang, Yuxuan Wang, and Mark D Plumbley. AudioLDM 2: Learn- ing holistic audio generation with self-supervised pretraining.arXiv Preprint,
-
[52]
Huadai Liu, Jialei Wang, Kaicheng Luo, Wen Wang, Qian Chen, Zhou Zhao, and Wei Xue. ThinkSound: Chain-of-thought reasoning in multimodal large language models for audio generation and editing. InNeurIPS, 2025. 4
work page 2025
-
[53]
Shansong Liu, Atin Sakkeer Hussain, Qilong Wu, Chenshuo Sun, and Ying Shan. M2 UGen: Multi-modal music understanding and generation with the power of large language models.arXiv Preprint, 2023. 4
work page 2023
-
[54]
Shansong Liu, Atin Sakkeer Hussain, Qilong Wu, Chenshuo Sun, and Ying Shan. MuMu-LLaMA: Multi-modal music understanding and generation via large lan- guage models.arXiv Preprint, 2024. 9, 10, 18
work page 2024
-
[55]
Tell what you hear from what you see-video to audio generation through text
Xiulong Liu, Kun Su, and Eli Shlizerman. Tell what you hear from what you see-video to audio generation through text. InNeurIPS, 2024. 4
work page 2024
-
[56]
Extending visual dy- namics for video-to-music generation.arXiv Preprint, 2025
Xiaohao Liu, Teng Tu, Yunshan Ma, and Tat-Seng Chua. Extending visual dy- namics for video-to-music generation.arXiv Preprint, 2025. 2, 4 24 Lin et al
work page 2025
-
[57]
SongGen: A single stage auto- regressive transformer for text-to-song generation
Zihan Liu, Shuangrui Ding, Zhixiong Zhang, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Dahua Lin, and Jiaqi Wang. SongGen: A single stage auto- regressive transformer for text-to-song generation. InICML, 2025. 1, 4
work page 2025
-
[58]
Diff-Foley: Synchro- nized video-to-audio synthesis with latent diffusion models
Simian Luo, Chuanhao Yan, Chenxu Hu, and Hang Zhao. Diff-Foley: Synchro- nized video-to-audio synthesis with latent diffusion models. InNeurIPS, 2023. 4
work page 2023
-
[59]
Ilaria Manco, Benno Weck, Seungheon Doh, Minz Won, Yixiao Zhang, Dmitry Bogdanov, Yusong Wu, Ke Chen, Philip Tovstogan, Emmanouil Benetos, et al. The song describer dataset: a corpus of audio captions for music-and-language evaluation.arXiv Preprint, 2023. 9, 10
work page 2023
-
[60]
FoleyGen: Visually-guided audio generation.arXiv Preprint, 2023
Xinhao Mei, Varun Nagaraja, Gael Le Lan, Zhaoheng Ni, Ernie Chang, Yangyang Shi, and Vikas Chandra. FoleyGen: Visually-guided audio generation.arXiv Preprint, 2023. 4
work page 2023
-
[61]
Mustango: Toward controllable text-to-music generation.arXiv Preprint, 2023
Jan Melechovsky, Zixun Guo, Deepanway Ghosal, Navonil Majumder, Dorien Herremans, and Soujanya Poria. Mustango: Toward controllable text-to-music generation.arXiv Preprint, 2023. 4, 6
work page 2023
-
[62]
Fast text-to-audio generation with adversarial post-training
Zachary Novack, Zach Evans, Zack Zukowski, Josiah Taylor, CJ Carr, Julian Parker, Adnan Al-Sinan, Gian Marco Iodice, Julian McAuley, Taylor Berg- Kirkpatrick, et al. Fast text-to-audio generation with adversarial post-training. arXiv Preprint, 2025. 4
work page 2025
-
[63]
DITTO: Diffusion inference-time t-optimization for music generation
ZacharyNovack,JulianMcAuley,TaylorBerg-Kirkpatrick,andNicholasJ.Bryan. DITTO: Diffusion inference-time t-optimization for music generation. InICML,
-
[64]
Zachary Novack, Ge Zhu, Jonah Casebeer, Julian McAuley, Taylor Berg- Kirkpatrick, and Nicholas J. Bryan. Presto! distilling steps and layers for ac- celerating music generation. InICLR, 2025. 8
work page 2025
-
[65]
DINOv2: Learning robust visual features without supervision.arXiv Preprint, 2023
Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El- Nouby, et al. DINOv2: Learning robust visual features without supervision.arXiv Preprint, 2023. 7, 8, 10, 13, 18
work page 2023
-
[66]
Andrew Owens, Phillip Isola, Josh McDermott, Antonio Torralba, Edward H Adelson, and William T Freeman. Visually indicated sounds. InCVPR, 2016. 4
work page 2016
-
[67]
Scalable diffusion models with transformers
William Peebles and Saining Xie. Scalable diffusion models with transformers. In ICCV, pages 4195–4205, 2023. 6
work page 2023
-
[68]
Self-similarity-based and novelty-based loss for music structure analysis
Geoffroy Peeters. Self-similarity-based and novelty-based loss for music structure analysis. InInternational Society of Music Information Retreival, 2023. 6
work page 2023
-
[69]
Movie Gen: A cast of media foundation models.arXiv Preprint, 2024
Adam Polyak, Amit Zohar, Andrew Brown, Andros Tjandra, Animesh Sinha, Ann Lee, Apoorv Vyas, Bowen Shi, Chih-Yao Ma, Ching-Yao Chuang, et al. Movie Gen: A cast of media foundation models.arXiv Preprint, 2024. 9, 10
work page 2024
-
[70]
Customized condition controllable generation for video soundtrack
Fan Qi, Kunsheng Ma, and Changsheng Xu. Customized condition controllable generation for video soundtrack. InCVPR, 2025. 4
work page 2025
-
[71]
Robust speech recognition via large-scale weak supervision
Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision. InICML, 2023. 16
work page 2023
-
[72]
Foley control: Aligning a frozen latent text-to-audio model to video
Ciara Rowles, Varun Jampani, Simon Donné, Shimon Vainer, Julian Parker, and Zach Evans. Foley control: Aligning a frozen latent text-to-audio model to video. arXiv Preprint, 2025. 4
work page 2025
-
[73]
Moûsai: Text-to-music generation with long-context latent diffusion.arXiv Preprint, 2023
Flavio Schneider, Ojasv Kamal, Zhijing Jin, and Bernhard Schölkopf. Moûsai: Text-to-music generation with long-context latent diffusion.arXiv Preprint, 2023. 1, 4
work page 2023
-
[74]
Sizhe Shan, Qiulin Li, Yutao Cui, Miles Yang, Yuehai Wang, Qun Yang, Jin Zhou, and Zhao Zhong. Hunyuanvideo-foley: Multimodal diffusion with representation alignment for high-fidelity foley audio generation.arXiv Preprint, 2025. 17 V2M-Zero25
work page 2025
-
[75]
Megha Sharma, Muhammad Taimoor Haseeb, Gus Xia, and Yoshimasa Tsuruoka. M2M-Gen: A multimodal framework for automated background music generation in japanese manga using large language models.arXiv Preprint, 2024. 4
work page 2024
-
[76]
Matching local self-similarities across images and videos
Eli Shechtman and Michal Irani. Matching local self-similarities across images and videos. InCVPR, 2007. 6
work page 2007
-
[77]
Eli Shlizerman, Lucio Dery, Hayden Schoen, and Ira Kemelmacher-Shlizerman. Audio to body dynamics. InCVPR, 2018. 4
work page 2018
-
[78]
V2Meow: Me- owing to the visual beat via music generation
Kun Su, Judith Yue Li, Qingqing Huang, Dima Kuzmin, Joonseok Lee, Chris Donahue, Fei Sha, Aren Jansen, Yu Wang, Mauro Verzetti, et al. V2Meow: Me- owing to the visual beat via music generation. InAAAI, 2024. 4
work page 2024
-
[79]
From vision to audio and beyond: A unified model for audio-visual representation and generation
Kun Su, Xiulong Liu, and Eli Shlizerman. From vision to audio and beyond: A unified model for audio-visual representation and generation. InICML, 2024. 4
work page 2024
-
[80]
Enhancing dance- to-music generation via negative conditioning latent diffusion model
Changchang Sun, Gaowen Liu, Charles Fleming, and Yan Yan. Enhancing dance- to-music generation via negative conditioning latent diffusion model. InCVPR,
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.