V2M-Zero: Zero-Pair Time-Aligned Video-to-Music Generation

Aniruddha Mahapatra; Gedas Bertasius; Jonah Casebeer; Long Mai; Nicholas J. Bryan; Yan-Bo Lin

arxiv: 2603.11042 · v2 · pith:UMFKTNZTnew · submitted 2026-03-11 · 💻 cs.CV · cs.AI· cs.LG· cs.MM· cs.SD

V2M-Zero: Zero-Pair Time-Aligned Video-to-Music Generation

Yan-Bo Lin , Jonah Casebeer , Long Mai , Aniruddha Mahapatra , Gedas Bertasius , Nicholas J. Bryan This is my paper

Pith reviewed 2026-05-15 13:15 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LGcs.MMcs.SD

keywords video-to-music generationzero-pair learningtemporal alignmentevent curvesintra-modal similaritydisentangled controltext-to-music fine-tuning

0 comments

The pith

Event curves from intra-modal similarities enable zero-pair training for time-aligned video-to-music generation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that temporal alignment between video and generated music can be achieved by computing event curves separately within each modality from pretrained encoders, then substituting video curves into a music model at inference. This avoids any requirement for paired video-music training data while disentangling timing control from semantic factors such as genre or mood. A reader would care because paired cross-modal datasets are costly and scarce, and the reported results show higher performance than paired baselines across objective metrics and human listening tests. The core insight is that synchronization depends on matching the timing and magnitude of changes rather than their semantic content.

Core claim

We introduce V2M-ZERO, a video-to-music generation approach that generates time-aligned music with disentangled time synchronization and semantic control from video while requiring zero video-music pairs at training time. Temporal synchronization requires matching when and how much change occurs, not what changes. Shared temporal structure is captured independently within each modality through event curves computed from intra-modal similarity using pretrained music and video encoders. These curves provide comparable representations across modalities, enabling a training strategy of fine-tuning a text-to-music model on music-event curves and substituting video-event curves at inference. This

What carries the argument

Event curves computed from intra-modal similarity using pretrained encoders, which capture the timing and magnitude of changes independently per modality to allow direct substitution for alignment.

If this is right

Surpasses prior baselines with 5-9% higher audio quality, 13-15% better semantic alignment, and 21-52% improved temporal synchronization without any paired data.
Achieves 28% higher beat alignment on dance videos from the AIST++ benchmark.
Delivers comparable gains in large-scale crowd-sourced subjective listening tests.
Enables separate controls for timing via event curves and for style via text prompts such as genre or mood.
Validates that within-modality temporal features outperform paired cross-modal supervision for this task.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The intra-modal curve approach could transfer to other cross-modal tasks where paired data is limited, such as aligning generated audio to text descriptions or images.
Treating temporal structure as modality-independent may lower data requirements for training multimodal generators in domains like film scoring or dance music synthesis.
If the substitution works reliably, it opens a path to iterative refinement where timing curves are edited independently of semantic prompts.

Load-bearing premise

Event curves from intra-modal similarities using pretrained encoders provide comparable representations across video and music modalities that can be directly substituted at inference without cross-modal training.

What would settle it

On benchmarks such as OES-Pub or MovieGenBench-Music, if models using substituted video event curves show no improvement or worse temporal synchronization and beat alignment metrics than the strongest paired cross-modal baselines, the claim is falsified.

Figures

Figures reproduced from arXiv: 2603.11042 by Aniruddha Mahapatra, Gedas Bertasius, Jonah Casebeer, Long Mai, Nicholas J. Bryan, Yan-Bo Lin.

**Figure 1.** Figure 1: Zero-Pair Video-to-Music Generation Top: Generating music for video commonly requires large-scale collections of high-quality, paired video-music data. Middle: Our V2M-Zero method is trained only on text–music pairs with an additional music-event curve condition (no video). Bottom: At inference, we swap a music-event curve with aligned video-event curves extracted via off-the-shelf vision models and gener… view at source ↗

**Figure 2.** Figure 2: Shared Temporal Structure Across Modalities. Real event curves computed from video and music exhibit similar temporal patterns across diverse video scenarios. Ground-truth pairs have correlation ≈ 0.6, introducing random offsets degrades this to ≈ 0.2. 1 In practice, video and music synchronization often corresponds to (sparse) moments of interest or events over time (e.g., video events of dancing and s… view at source ↗

**Figure 3.** Figure 3: Method Overview Top: During training, V2M-Zero learns a rectified-flow diffusion process conditioned on text prompts and a music-event curve derived from intra-music similarity. Bottom: At inference, music conditioning is swapped with a video-event curve based on framewise similarity, enabling zero-pair, time-synchronized video-to-music generation. For semantic alignment, a text prompt is predicted from th… view at source ↗

**Figure 4.** Figure 4: Impact of Smoothing Kernel Size. Larger kernerls improve audio quality (FAD*) but temporal alignment (SCH) has an optimal point on OES-Pub. We systematically ablate four design axes: (i) kernel size for modality gap mitigation, (ii) encoders for event-curve extraction, (iii) domain-specific visual encoders, and (iv) LLM selection for prompt generation. Mitigating Modality Gap. Music-event curves (training… view at source ↗

**Figure 5.** Figure 5: Example event curves with different temporal dynamics. [PITH_FULL_IMAGE:figures/full_fig_p019_5.png] view at source ↗

read the original abstract

Generating music that temporally aligns with video events is challenging for existing text-to-music models, which lack fine-grained temporal control. We introduce V2M-ZERO, a video-to-music generation approach that generates time-aligned music with disentangled time synchronization and semantic control (e.g., genre, mood) from video while requiring zero video-music pairs at training time. Our method is motivated by a key observation: temporal synchronization requires matching when and how much change occurs, not what changes. While musical and visual events differ semantically, they exhibit shared temporal structure that can be captured independently within each modality. We capture this structure through event curves computed from intra-modal similarity using pretrained music and video encoders. By measuring temporal change within each modality independently, these curves provide comparable representations across modalities. This enables a simple training strategy: fine-tune a text-to-music model on music-event curves, then substitute video-event curves at inference without cross-modal training or paired data. Across OES-Pub, MovieGenBench-Music, and AIST++, V2M-ZERO achieves state-of-the-art performance without any paired music-video data, surpassing the strongest prior baselines per metric with 5-9% higher audio quality, 13-15% better semantic alignment, 21-52% improved temporal synchronization, and 28% higher beat alignment on dance videos. We find similar results via a large crowd-source subjective listening test. Our results validate that temporal alignment through within-modality features is not only effective for video-to-music generation but also leads to better performance than paired cross-modal supervision. Furthermore, our approach enables independent controls for timing and music style (e.g., genre, mood) for more controllable generation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

V2M-Zero gets time-aligned video music without pairs by swapping intra-modal event curves into a fine-tuned text-to-music model, and the numbers look decent, but the curves' direct comparability is assumed rather than shown.

read the letter

The main contribution is the zero-pair training trick: derive event curves from a pretrained music encoder, fine-tune a text-to-music model on those curves, then drop in video event curves from a separate video encoder at inference. This sidesteps paired data while claiming better temporal sync than paired baselines. The reported gains are 5-9% audio quality, 13-15% semantic alignment, 21-52% temporal synchronization, and 28% beat alignment on dance clips across OES-Pub, MovieGenBench-Music, and AIST++, backed by a crowd-sourced listening test. The disentangled timing versus style control is a practical plus for creative tools.

Referee Report

1 major / 2 minor

Summary. The paper presents V2M-Zero, a zero-pair method for video-to-music generation that achieves temporal alignment by computing event curves independently from intra-modal similarities using pretrained video and music encoders. It fine-tunes a text-to-music model on music event curves only, then substitutes video event curves at inference to control timing while allowing separate semantic control (e.g., genre, mood). The approach reports state-of-the-art results across OES-Pub, MovieGenBench-Music, and AIST++ with gains of 5-9% in audio quality, 13-15% in semantic alignment, 21-52% in temporal synchronization, and 28% in beat alignment, plus subjective validation, claiming superiority over paired cross-modal supervision.

Significance. If the central claim holds, the work would be significant for showing that within-modality temporal structures can substitute for paired data in cross-modal alignment tasks, enabling more controllable generation and reducing dependence on expensive paired datasets. This has potential implications for zero-shot multimodal synthesis in computer vision and audio, provided the event-curve equivalence is validated.

major comments (1)

[Method (event curve substitution)] Method (event curve substitution, as described in abstract and method): The claim that event curves from pretrained video and music encoders provide comparable representations for direct substitution rests on the unverified assumption of shared temporal structure without reported normalization, moment-matching, distributional comparison, or dynamic-range analysis between modalities. This is load-bearing for the zero-pair training strategy and the reported temporal gains (21-52%); unaccounted domain shift could attribute improvements to model robustness instead.

minor comments (2)

[Abstract and results] Abstract and results: Exact baselines, statistical significance tests, and controls for data leakage are not detailed for the reported metric improvements across the three benchmarks.
[Subjective evaluation] Subjective evaluation: Additional specifics on the crowd-sourced listening test (participant count, rating scales, and statistical analysis) would strengthen the validation claims.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their thoughtful review and for identifying a key methodological assumption in V2M-Zero. We address the concern point-by-point below and will incorporate additional analysis in the revision.

read point-by-point responses

Referee: The claim that event curves from pretrained video and music encoders provide comparable representations for direct substitution rests on the unverified assumption of shared temporal structure without reported normalization, moment-matching, distributional comparison, or dynamic-range analysis between modalities. This is load-bearing for the zero-pair training strategy and the reported temporal gains (21-52%); unaccounted domain shift could attribute improvements to model robustness instead.

Authors: We agree that explicit verification strengthens the central claim. Event curves are computed identically within each modality: given a sequence of embeddings from a pretrained encoder, we form a temporal similarity matrix and derive a normalized change curve (scaled to [0,1] by the maximum intra-sequence difference). This formulation ensures both curves measure the same quantity—magnitude and timing of change—independent of semantic content. While the original submission did not include side-by-side distributional plots, the 21-52% temporal synchronization gains over paired baselines across three datasets would be unlikely if domain shift were dominant; such shift would typically harm rather than enhance alignment. In the revised manuscript we will add a dedicated subsection with (i) normalization details, (ii) histogram comparisons of video vs. music curve values, and (iii) dynamic-range statistics, confirming the curves occupy statistically similar ranges. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained via independent intra-modal computations

full rationale

The paper computes event curves separately within each modality from pretrained encoders, fine-tunes the text-to-music model exclusively on music-derived curves, and substitutes video curves at inference under the explicit assumption of shared temporal structure. No equation reduces the alignment metric or generated output to a fitted parameter defined from the same data or to a self-citation chain. The central zero-pair claim rests on empirical validation across separate benchmarks rather than on any definitional equivalence or imported uniqueness theorem. This is the expected non-finding for a method whose core steps remain externally verifiable.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on one domain assumption about cross-modal comparability of event curves and introduces one invented representation; no free parameters are explicitly fitted in the abstract description.

axioms (1)

domain assumption Event curves derived from intra-modal similarity in pretrained video and music encoders capture comparable temporal structure across modalities.
Invoked to justify substituting video-event curves for music-event curves at inference without paired training.

invented entities (1)

event curves no independent evidence
purpose: To represent timing of changes independently within each modality for alignment.
New representation introduced to enable zero-pair substitution.

pith-pipeline@v0.9.0 · 5642 in / 1442 out tokens · 68158 ms · 2026-05-15T13:15:18.525988+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

111 extracted references · 111 canonical work pages

[1]

MusicLM: Generating music from text.arXiv Preprint, 2023

Andrea Agostinelli, Timo I Denk, Zalán Borsos, Jesse Engel, Mauro Verzetti, An- toine Caillon, Qingqing Huang, Aren Jansen, Adam Roberts, Marco Tagliasacchi, et al. MusicLM: Generating music from text.arXiv Preprint, 2023. 1, 4

work page 2023
[2]

V-JEPA 2: Self-supervised video models enable understanding, prediction and planning.arXiv Preprint, 2025

Mido Assran, Adrien Bardes, David Fan, Quentin Garrido, Russell Howes, Matthew Muckley, Ammar Rizvi, Claire Roberts, Koustuv Sinha, Artem Zholus, et al. V-JEPA 2: Self-supervised video models enable understanding, prediction and planning.arXiv Preprint, 2025. 8, 13

work page 2025
[3]

Yatong Bai, Jonah Casebeer, Somayeh Sojoudi, and Nicholas J. Bryan. DRAGON: Distributional rewards optimize diffusion generative models.TMLR,

work page
[4]

AudioLM: a language modeling approach to audio generation

Zalán Borsos, Raphaël Marinier, Damien Vincent, Eugene Kharitonov, Olivier Pietquin, Matt Sharifi, Dominik Roblek, Olivier Teboul, David Grangier, Marco Tagliasacchi, et al. AudioLM: a language modeling approach to audio generation. TASLP, 2023. 4

work page 2023
[5]

Re-bottleneck: Latent re-structuring for neural audio autoencoders

Dimitrios Bralios, Jonah Casebeer, and Paris Smaragdis. Re-bottleneck: Latent re-structuring for neural audio autoencoders. InMLSP, 2025. 8

work page 2025
[6]

Learning to upsample and upmix audio in the latent domain.arXiv Preprint, 2025

Dimitrios Bralios, Paris Smaragdis, and Jonah Casebeer. Learning to upsample and upmix audio in the latent domain.arXiv Preprint, 2025. 8

work page 2025
[7]

A generative-first neural audio autoencoder.arXiv:2602.15749, 2026

Jonah Casebeer, Ge Zhu, Zhepei Wang, and Nicholas J Bryan. A generative-first neural audio autoencoder.arXiv:2602.15749, 2026. 5, 13

work page arXiv 2026
[8]

Visually indicated sound generation by perceptually optimized classifi- cation

Kan Chen, Chuanxi Zhang, Chen Fang, Zhaowen Wang, Trung Bui, and Ram Nevatia. Visually indicated sound generation by perceptually optimized classifi- cation. InECCVW, 2018. 4

work page 2018
[9]

Generating visually aligned sound from videos.TIP, 2020

Peihao Chen, Yang Zhang, Mingkui Tan, Hongdong Xiao, Deng Huang, and Chuang Gan. Generating visually aligned sound from videos.TIP, 2020. 4

work page 2020
[10]

Images that sound: Composing images and sounds on a single canvas

Ziyang Chen, Daniel Geng, and Andrew Owens. Images that sound: Composing images and sounds on a single canvas. InNeurIPS, 2024. 4

work page 2024
[11]

Video-guided foley sound generation with multimodal controls

Ziyang Chen, Prem Seetharaman, Bryan Russell, Oriol Nieto, David Bourgin, Andrew Owens, and Justin Salamon. Video-guided foley sound generation with multimodal controls. InCVPR, 2025. 4

work page 2025
[12]

Mmaudio: Taming multimodal joint training for high-quality video-to-audio synthesis

Ho Kei Cheng, Masato Ishii, Akio Hayakawa, Takashi Shibuya, Alexander Schwing, and Yuki Mitsufuji. Mmaudio: Taming multimodal joint training for high-quality video-to-audio synthesis. InCVPR, 2025. 4

work page 2025
[13]

MeLFusion: Synthesizing music from image and language cues using diffusion models

Sanjoy Chowdhury, Sayan Nag, KJ Joseph, Balaji Vasan Srinivasan, and Dinesh Manocha. MeLFusion: Synthesizing music from image and language cues using diffusion models. InCVPR, 2024. 2, 4

work page 2024
[14]

Cocola: Coherence-oriented con- trastive learning of musical audio representations.arXiv Preprint, 2024

Ruben Ciranni, Giorgio Mariani, Michele Mancusi, Emilian Postolache, Giorgio Fabbro, Emanuele Rodolà, and Luca Cosmo. Cocola: Coherence-oriented con- trastive learning of musical audio representations.arXiv Preprint, 2024. 12

work page 2024
[15]

Simple and controllable music generation

Jade Copet, Felix Kreuk, Itai Gat, Tal Remez, David Kant, Gabriel Synnaeve, Yossi Adi, and Alexandre Défossez. Simple and controllable music generation. In NeurIPS, 2023. 1, 4

work page 2023
[16]

High fidelity neural audio compression.arXiv Preprint, 2022

Alexandre Défossez, Jade Copet, Gabriel Synnaeve, and Yossi Adi. High fidelity neural audio compression.arXiv Preprint, 2022. 4

work page 2022
[17]

Video background music generation with controllable music transformer

Shangzhe Di, Zeren Jiang, Si Liu, Zhaokai Wang, Leyan Zhu, Zexin He, Hongming Liu, and Shuicheng Yan. Video background music generation with controllable music transformer. InACM MM, 2021. 4, 12

work page 2021
[18]

Conditional generation of audio from video via foley analogies

Yuexi Du, Ziyang Chen, Justin Salamon, Bryan Russell, and Andrew Owens. Conditional generation of audio from video via foley analogies. InCVPR, 2023. 4 22 Lin et al

work page 2023
[19]

Fast timing- conditioned latent audio diffusion.arXiv Preprint, 2024

Zach Evans, CJ Carr, Josiah Taylor, Scott H Hawley, and Jordi Pons. Fast timing- conditioned latent audio diffusion.arXiv Preprint, 2024. 4

work page 2024
[20]

Stable audio open

Zach Evans, Julian D Parker, CJ Carr, Zack Zukowski, Josiah Taylor, and Jordi Pons. Stable audio open. InICASSP, 2025. 4, 6, 8

work page 2025
[21]

Flux that plays music.arXiv Preprint, 2024

Zhengcong Fei, Mingyuan Fan, Changqian Yu, and Junshi Huang. Flux that plays music.arXiv Preprint, 2024. 1, 4, 6

work page 2024
[22]

Visualizing musical structure and rhythm via self-similarity

Jonathan Foote and Matthew Cooper. Visualizing musical structure and rhythm via self-similarity. InICMC, 2001. 6

work page 2001
[23]

Text-to-audio generation using instruction guided latent diffusion model

Deepanway Ghosal, Navonil Majumder, Ambuj Mehrish, and Soujanya Poria. Text-to-audio generation using instruction guided latent diffusion model. InACM MM, 2023. 4, 6

work page 2023
[24]

ACE-Step: A step towards music generation foundation model.arXiv Preprint, 2025

Junmin Gong, Sean Zhao, Sen Wang, Shengyuan Xu, and Joe Guo. ACE-Step: A step towards music generation foundation model.arXiv Preprint, 2025. 1, 4

work page 2025
[25]

The llama 3 herd of models.arXiv Preprint, 2024

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Ab- hishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv Preprint, 2024. 14

work page 2024
[26]

it’s more of a vibe i’m going for

Noor Hammad, C Ailie Fraser, Erik Harpstead, Jessica Hammer, and Mira Dontcheva. “it’s more of a vibe i’m going for”: Designing text-to-music gener- ation interfaces for video creators. InDIS, 2025. 2, 3, 5, 8, 14, 16

work page 2025
[27]

Cnn architectures for large-scale audio classification

Shawn Hershey, Sourish Chaudhuri, Daniel PW Ellis, Jort F Gemmeke, Aren Jansen, R Channing Moore, Manoj Plakal, Devin Platt, Rif A Saurous, Bryan Seybold, et al. Cnn architectures for large-scale audio classification. InICASSP,

work page
[28]

Classifier-freediffusionguidance.arXiv Preprint,

JonathanHoandTimSalimans. Classifier-freediffusionguidance.arXiv Preprint,

work page
[29]

A simple sequentially rejective multiple test procedure.Scandinavian journal of statistics, 1979

Sture Holm. A simple sequentially rejective multiple test procedure.Scandinavian journal of statistics, 1979. 11

work page 1979
[30]

Make-an-audio: Text-to- audio generation with prompt-enhanced diffusion models

Rongjie Huang, Jiawei Huang, Dongchao Yang, Yi Ren, Luping Liu, Mingze Li, Zhenhui Ye, Jinglin Liu, Xiang Yin, and Zhou Zhao. Make-an-audio: Text-to- audio generation with prompt-enhanced diffusion models. InICML, 2023. 4

work page 2023
[31]

MusiScene: Leveraging mu-llama forscene imaginationand enhancedvideo background music generation

Fathinah Izzati, Xinyue Li, Yuxuan Wu, and Gus Xia. MusiScene: Leveraging mu-llama forscene imaginationand enhancedvideo background music generation. arXiv Preprint, 2025. 4

work page 2025
[32]

Video2music: Suitable music generation from videos using an affective multimodal transformer model

Jaeyong Kang, Soujanya Poria, and Dorien Herremans. Video2music: Suitable music generation from videos using an affective multimodal transformer model. arXiv Preprint, 2023. 4

work page 2023
[33]

Cotracker: It is better to track together

Nikita Karaev, Ignacio Rocco, Benjamin Graham, Natalia Neverova, Andrea Vedaldi, and Christian Rupprecht. Cotracker: It is better to track together. In ECCV, 2024. 8, 12, 14

work page 2024
[34]

Fr\’echet audio distance: A metric for evaluating music enhancement algorithms

Kevin Kilgour, Mauricio Zuluaga, Dominik Roblek, and Matthew Sharifi. Fr\’echet audio distance: A metric for evaluating music enhancement algorithms. arXiv Preprint, 2018. 9

work page 2018
[35]

Video-guided text-to-music generation using public domain movie collections

Haven Kim, Zachary Novack, Weihan Xu, Julian McAuley, and Hao-Wen Dong. Video-guided text-to-music generation using public domain movie collections. In ISMIR, 2025. 2, 4, 8, 10

work page 2025
[36]

Audiogen: Textually guided audio generation

Felix Kreuk, Gabriel Synnaeve, Adam Polyak, Uriel Singer, Alexandre Défossez, Jade Copet, Devi Parikh, Yaniv Taigman, and Yossi Adi. Audiogen: Textually guided audio generation. InICLR, 2023. 4

work page 2023
[37]

High-fidelity audio compression with improved RVQGAN

Rithesh Kumar, Prem Seetharaman, Alejandro Luebs, Ishaan Kumar, and Kun- dan Kumar. High-fidelity audio compression with improved RVQGAN. In NeurIPS, 2023. 4 V2M-Zero23

work page 2023
[38]

VinTAGe: Joint video and text conditioning for holistic audio generation

Saksham Singh Kushwaha and Yapeng Tian. VinTAGe: Joint video and text conditioning for holistic audio generation. InCVPR, 2025. 4

work page 2025
[39]

Learning self- similarity in space and time as generalized motion for video action recognition

Heeseung Kwon, Manjin Kim, Suha Kwak, and Minsu Cho. Learning self- similarity in space and time as generalized motion for video action recognition. InICCV, 2021. 6

work page 2021
[40]

Efficient neural music generation

Max WY Lam, Qiao Tian, Tang Li, Zongyu Yin, Siyuan Feng, Ming Tu, Yuliang Ji, Rui Xia, Mingbo Ma, Xuchen Song, et al. Efficient neural music generation. InNeurIPS, 2023. 1, 4

work page 2023
[41]

Dancing to music

Hsin-Ying Lee, Xiaodong Yang, Ming-Yu Liu, Ting-Chun Wang, Yu-Ding Lu, Ming-Hsuan Yang, and Jan Kautz. Dancing to music. InNeurIPS, 2019. 4

work page 2019
[42]

Mozart’s touch: a lightweight multimodal music generation framework based on pre-trained large models

Jiajun Li, Tianze Xu, Xuesong Chen, Xinrui Yao, Jingchou Han, and Shuchang Liu. Mozart’s touch: a lightweight multimodal music generation framework based on pre-trained large models. InAIGC, 2025. 4

work page 2025
[43]

AI choreographer: Music conditioned 3d dance generation with AIST++

Ruilong Li, Shan Yang, David A Ross, and Angjoo Kanazawa. AI choreographer: Music conditioned 3d dance generation with AIST++. InICCV, 2021. 9, 11, 12

work page 2021
[44]

MuVi: Video-to-music generation with semantic alignment and rhythmic synchro- nization.arXiv Preprint, 2024

Ruiqi Li, Siqi Zheng, Xize Cheng, Ziang Zhang, Shengpeng Ji, and Zhou Zhao. MuVi: Video-to-music generation with semantic alignment and rhythmic synchro- nization.arXiv Preprint, 2024. 2, 4

work page 2024
[45]

Dance-to-music generation with encoder- based textual inversion

Sifei Li, Weiming Dong, Yuxin Zhang, Fan Tang, Chongyang Ma, Oliver Deussen, Tong-Yee Lee, and Changsheng Xu. Dance-to-music generation with encoder- based textual inversion. InSIGGRAPH Asia, 2024. 4, 9, 12, 16

work page 2024
[46]

Diff-BGM: A diffusion model for video background music generation

Sizhe Li, Yiming Qin, Minghang Zheng, Xin Jin, and Yang Liu. Diff-BGM: A diffusion model for video background music generation. InCVPR, 2024. 4

work page 2024
[47]

VidMusician: Video-to-music generation with semantic-rhythmic alignment via hierarchical visual features.arXiv Preprint, 2024

Sifei Li, Binxin Yang, Chunji Yin, Chong Sun, Yuxin Zhang, Weiming Dong, and Chen Li. VidMusician: Video-to-music generation with semantic-rhythmic alignment via hierarchical visual features.arXiv Preprint, 2024. 2, 4

work page 2024
[48]

Siamese vision transformers are scalable audio- visual learners

Yan-Bo Lin and Gedas Bertasius. Siamese vision transformers are scalable audio- visual learners. InECCV, 2024. 8, 13

work page 2024
[49]

VMAS: Video-to-music generationviasemanticalignment inweb musicvideos

Yan-Bo Lin, Yu Tian, Linjie Yang, Gedas Bertasius, and Heng Wang. VMAS: Video-to-music generationviasemanticalignment inweb musicvideos. InW ACV,

work page
[50]

AudioLDM: Text-to-audio generation with latent diffusion models

Haohe Liu, Zehua Chen, Yi Yuan, Xinhao Mei, Xubo Liu, Danilo Mandic, Wenwu Wang, and Mark D Plumbley. AudioLDM: Text-to-audio generation with latent diffusion models. InICML, 2023. 4, 6

work page 2023
[51]

AudioLDM 2: Learn- ing holistic audio generation with self-supervised pretraining.arXiv Preprint,

Haohe Liu, Qiao Tian, Yi Yuan, Xubo Liu, Xinhao Mei, Qiuqiang Kong, Yuping Wang, Wenwu Wang, Yuxuan Wang, and Mark D Plumbley. AudioLDM 2: Learn- ing holistic audio generation with self-supervised pretraining.arXiv Preprint,

work page
[52]

ThinkSound: Chain-of-thought reasoning in multimodal large language models for audio generation and editing

Huadai Liu, Jialei Wang, Kaicheng Luo, Wen Wang, Qian Chen, Zhou Zhao, and Wei Xue. ThinkSound: Chain-of-thought reasoning in multimodal large language models for audio generation and editing. InNeurIPS, 2025. 4

work page 2025
[53]

M2 UGen: Multi-modal music understanding and generation with the power of large language models.arXiv Preprint, 2023

Shansong Liu, Atin Sakkeer Hussain, Qilong Wu, Chenshuo Sun, and Ying Shan. M2 UGen: Multi-modal music understanding and generation with the power of large language models.arXiv Preprint, 2023. 4

work page 2023
[54]

MuMu-LLaMA: Multi-modal music understanding and generation via large lan- guage models.arXiv Preprint, 2024

Shansong Liu, Atin Sakkeer Hussain, Qilong Wu, Chenshuo Sun, and Ying Shan. MuMu-LLaMA: Multi-modal music understanding and generation via large lan- guage models.arXiv Preprint, 2024. 9, 10, 18

work page 2024
[55]

Tell what you hear from what you see-video to audio generation through text

Xiulong Liu, Kun Su, and Eli Shlizerman. Tell what you hear from what you see-video to audio generation through text. InNeurIPS, 2024. 4

work page 2024
[56]

Extending visual dy- namics for video-to-music generation.arXiv Preprint, 2025

Xiaohao Liu, Teng Tu, Yunshan Ma, and Tat-Seng Chua. Extending visual dy- namics for video-to-music generation.arXiv Preprint, 2025. 2, 4 24 Lin et al

work page 2025
[57]

SongGen: A single stage auto- regressive transformer for text-to-song generation

Zihan Liu, Shuangrui Ding, Zhixiong Zhang, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Dahua Lin, and Jiaqi Wang. SongGen: A single stage auto- regressive transformer for text-to-song generation. InICML, 2025. 1, 4

work page 2025
[58]

Diff-Foley: Synchro- nized video-to-audio synthesis with latent diffusion models

Simian Luo, Chuanhao Yan, Chenxu Hu, and Hang Zhao. Diff-Foley: Synchro- nized video-to-audio synthesis with latent diffusion models. InNeurIPS, 2023. 4

work page 2023
[59]

The song describer dataset: a corpus of audio captions for music-and-language evaluation.arXiv Preprint, 2023

Ilaria Manco, Benno Weck, Seungheon Doh, Minz Won, Yixiao Zhang, Dmitry Bogdanov, Yusong Wu, Ke Chen, Philip Tovstogan, Emmanouil Benetos, et al. The song describer dataset: a corpus of audio captions for music-and-language evaluation.arXiv Preprint, 2023. 9, 10

work page 2023
[60]

FoleyGen: Visually-guided audio generation.arXiv Preprint, 2023

Xinhao Mei, Varun Nagaraja, Gael Le Lan, Zhaoheng Ni, Ernie Chang, Yangyang Shi, and Vikas Chandra. FoleyGen: Visually-guided audio generation.arXiv Preprint, 2023. 4

work page 2023
[61]

Mustango: Toward controllable text-to-music generation.arXiv Preprint, 2023

Jan Melechovsky, Zixun Guo, Deepanway Ghosal, Navonil Majumder, Dorien Herremans, and Soujanya Poria. Mustango: Toward controllable text-to-music generation.arXiv Preprint, 2023. 4, 6

work page 2023
[62]

Fast text-to-audio generation with adversarial post-training

Zachary Novack, Zach Evans, Zack Zukowski, Josiah Taylor, CJ Carr, Julian Parker, Adnan Al-Sinan, Gian Marco Iodice, Julian McAuley, Taylor Berg- Kirkpatrick, et al. Fast text-to-audio generation with adversarial post-training. arXiv Preprint, 2025. 4

work page 2025
[63]

DITTO: Diffusion inference-time t-optimization for music generation

ZacharyNovack,JulianMcAuley,TaylorBerg-Kirkpatrick,andNicholasJ.Bryan. DITTO: Diffusion inference-time t-optimization for music generation. InICML,

work page
[64]

Zachary Novack, Ge Zhu, Jonah Casebeer, Julian McAuley, Taylor Berg- Kirkpatrick, and Nicholas J. Bryan. Presto! distilling steps and layers for ac- celerating music generation. InICLR, 2025. 8

work page 2025
[65]

DINOv2: Learning robust visual features without supervision.arXiv Preprint, 2023

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El- Nouby, et al. DINOv2: Learning robust visual features without supervision.arXiv Preprint, 2023. 7, 8, 10, 13, 18

work page 2023
[66]

Visually indicated sounds

Andrew Owens, Phillip Isola, Josh McDermott, Antonio Torralba, Edward H Adelson, and William T Freeman. Visually indicated sounds. InCVPR, 2016. 4

work page 2016
[67]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. In ICCV, pages 4195–4205, 2023. 6

work page 2023
[68]

Self-similarity-based and novelty-based loss for music structure analysis

Geoffroy Peeters. Self-similarity-based and novelty-based loss for music structure analysis. InInternational Society of Music Information Retreival, 2023. 6

work page 2023
[69]

Movie Gen: A cast of media foundation models.arXiv Preprint, 2024

Adam Polyak, Amit Zohar, Andrew Brown, Andros Tjandra, Animesh Sinha, Ann Lee, Apoorv Vyas, Bowen Shi, Chih-Yao Ma, Ching-Yao Chuang, et al. Movie Gen: A cast of media foundation models.arXiv Preprint, 2024. 9, 10

work page 2024
[70]

Customized condition controllable generation for video soundtrack

Fan Qi, Kunsheng Ma, and Changsheng Xu. Customized condition controllable generation for video soundtrack. InCVPR, 2025. 4

work page 2025
[71]

Robust speech recognition via large-scale weak supervision

Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision. InICML, 2023. 16

work page 2023
[72]

Foley control: Aligning a frozen latent text-to-audio model to video

Ciara Rowles, Varun Jampani, Simon Donné, Shimon Vainer, Julian Parker, and Zach Evans. Foley control: Aligning a frozen latent text-to-audio model to video. arXiv Preprint, 2025. 4

work page 2025
[73]

Moûsai: Text-to-music generation with long-context latent diffusion.arXiv Preprint, 2023

Flavio Schneider, Ojasv Kamal, Zhijing Jin, and Bernhard Schölkopf. Moûsai: Text-to-music generation with long-context latent diffusion.arXiv Preprint, 2023. 1, 4

work page 2023
[74]

Hunyuanvideo-foley: Multimodal diffusion with representation alignment for high-fidelity foley audio generation.arXiv Preprint, 2025

Sizhe Shan, Qiulin Li, Yutao Cui, Miles Yang, Yuehai Wang, Qun Yang, Jin Zhou, and Zhao Zhong. Hunyuanvideo-foley: Multimodal diffusion with representation alignment for high-fidelity foley audio generation.arXiv Preprint, 2025. 17 V2M-Zero25

work page 2025
[75]

M2M-Gen: A multimodal framework for automated background music generation in japanese manga using large language models.arXiv Preprint, 2024

Megha Sharma, Muhammad Taimoor Haseeb, Gus Xia, and Yoshimasa Tsuruoka. M2M-Gen: A multimodal framework for automated background music generation in japanese manga using large language models.arXiv Preprint, 2024. 4

work page 2024
[76]

Matching local self-similarities across images and videos

Eli Shechtman and Michal Irani. Matching local self-similarities across images and videos. InCVPR, 2007. 6

work page 2007
[77]

Audio to body dynamics

Eli Shlizerman, Lucio Dery, Hayden Schoen, and Ira Kemelmacher-Shlizerman. Audio to body dynamics. InCVPR, 2018. 4

work page 2018
[78]

V2Meow: Me- owing to the visual beat via music generation

Kun Su, Judith Yue Li, Qingqing Huang, Dima Kuzmin, Joonseok Lee, Chris Donahue, Fei Sha, Aren Jansen, Yu Wang, Mauro Verzetti, et al. V2Meow: Me- owing to the visual beat via music generation. InAAAI, 2024. 4

work page 2024
[79]

From vision to audio and beyond: A unified model for audio-visual representation and generation

Kun Su, Xiulong Liu, and Eli Shlizerman. From vision to audio and beyond: A unified model for audio-visual representation and generation. InICML, 2024. 4

work page 2024
[80]

Enhancing dance- to-music generation via negative conditioning latent diffusion model

Changchang Sun, Gaowen Liu, Charles Fleming, and Yan Yan. Enhancing dance- to-music generation via negative conditioning latent diffusion model. InCVPR,

work page

Showing first 80 references.

[1] [1]

MusicLM: Generating music from text.arXiv Preprint, 2023

Andrea Agostinelli, Timo I Denk, Zalán Borsos, Jesse Engel, Mauro Verzetti, An- toine Caillon, Qingqing Huang, Aren Jansen, Adam Roberts, Marco Tagliasacchi, et al. MusicLM: Generating music from text.arXiv Preprint, 2023. 1, 4

work page 2023

[2] [2]

V-JEPA 2: Self-supervised video models enable understanding, prediction and planning.arXiv Preprint, 2025

Mido Assran, Adrien Bardes, David Fan, Quentin Garrido, Russell Howes, Matthew Muckley, Ammar Rizvi, Claire Roberts, Koustuv Sinha, Artem Zholus, et al. V-JEPA 2: Self-supervised video models enable understanding, prediction and planning.arXiv Preprint, 2025. 8, 13

work page 2025

[3] [3]

Yatong Bai, Jonah Casebeer, Somayeh Sojoudi, and Nicholas J. Bryan. DRAGON: Distributional rewards optimize diffusion generative models.TMLR,

work page

[4] [4]

AudioLM: a language modeling approach to audio generation

Zalán Borsos, Raphaël Marinier, Damien Vincent, Eugene Kharitonov, Olivier Pietquin, Matt Sharifi, Dominik Roblek, Olivier Teboul, David Grangier, Marco Tagliasacchi, et al. AudioLM: a language modeling approach to audio generation. TASLP, 2023. 4

work page 2023

[5] [5]

Re-bottleneck: Latent re-structuring for neural audio autoencoders

Dimitrios Bralios, Jonah Casebeer, and Paris Smaragdis. Re-bottleneck: Latent re-structuring for neural audio autoencoders. InMLSP, 2025. 8

work page 2025

[6] [6]

Learning to upsample and upmix audio in the latent domain.arXiv Preprint, 2025

Dimitrios Bralios, Paris Smaragdis, and Jonah Casebeer. Learning to upsample and upmix audio in the latent domain.arXiv Preprint, 2025. 8

work page 2025

[7] [7]

A generative-first neural audio autoencoder.arXiv:2602.15749, 2026

Jonah Casebeer, Ge Zhu, Zhepei Wang, and Nicholas J Bryan. A generative-first neural audio autoencoder.arXiv:2602.15749, 2026. 5, 13

work page arXiv 2026

[8] [8]

Visually indicated sound generation by perceptually optimized classifi- cation

Kan Chen, Chuanxi Zhang, Chen Fang, Zhaowen Wang, Trung Bui, and Ram Nevatia. Visually indicated sound generation by perceptually optimized classifi- cation. InECCVW, 2018. 4

work page 2018

[9] [9]

Generating visually aligned sound from videos.TIP, 2020

Peihao Chen, Yang Zhang, Mingkui Tan, Hongdong Xiao, Deng Huang, and Chuang Gan. Generating visually aligned sound from videos.TIP, 2020. 4

work page 2020

[10] [10]

Images that sound: Composing images and sounds on a single canvas

Ziyang Chen, Daniel Geng, and Andrew Owens. Images that sound: Composing images and sounds on a single canvas. InNeurIPS, 2024. 4

work page 2024

[11] [11]

Video-guided foley sound generation with multimodal controls

Ziyang Chen, Prem Seetharaman, Bryan Russell, Oriol Nieto, David Bourgin, Andrew Owens, and Justin Salamon. Video-guided foley sound generation with multimodal controls. InCVPR, 2025. 4

work page 2025

[12] [12]

Mmaudio: Taming multimodal joint training for high-quality video-to-audio synthesis

Ho Kei Cheng, Masato Ishii, Akio Hayakawa, Takashi Shibuya, Alexander Schwing, and Yuki Mitsufuji. Mmaudio: Taming multimodal joint training for high-quality video-to-audio synthesis. InCVPR, 2025. 4

work page 2025

[13] [13]

MeLFusion: Synthesizing music from image and language cues using diffusion models

Sanjoy Chowdhury, Sayan Nag, KJ Joseph, Balaji Vasan Srinivasan, and Dinesh Manocha. MeLFusion: Synthesizing music from image and language cues using diffusion models. InCVPR, 2024. 2, 4

work page 2024

[14] [14]

Cocola: Coherence-oriented con- trastive learning of musical audio representations.arXiv Preprint, 2024

Ruben Ciranni, Giorgio Mariani, Michele Mancusi, Emilian Postolache, Giorgio Fabbro, Emanuele Rodolà, and Luca Cosmo. Cocola: Coherence-oriented con- trastive learning of musical audio representations.arXiv Preprint, 2024. 12

work page 2024

[15] [15]

Simple and controllable music generation

Jade Copet, Felix Kreuk, Itai Gat, Tal Remez, David Kant, Gabriel Synnaeve, Yossi Adi, and Alexandre Défossez. Simple and controllable music generation. In NeurIPS, 2023. 1, 4

work page 2023

[16] [16]

High fidelity neural audio compression.arXiv Preprint, 2022

Alexandre Défossez, Jade Copet, Gabriel Synnaeve, and Yossi Adi. High fidelity neural audio compression.arXiv Preprint, 2022. 4

work page 2022

[17] [17]

Video background music generation with controllable music transformer

Shangzhe Di, Zeren Jiang, Si Liu, Zhaokai Wang, Leyan Zhu, Zexin He, Hongming Liu, and Shuicheng Yan. Video background music generation with controllable music transformer. InACM MM, 2021. 4, 12

work page 2021

[18] [18]

Conditional generation of audio from video via foley analogies

Yuexi Du, Ziyang Chen, Justin Salamon, Bryan Russell, and Andrew Owens. Conditional generation of audio from video via foley analogies. InCVPR, 2023. 4 22 Lin et al

work page 2023

[19] [19]

Fast timing- conditioned latent audio diffusion.arXiv Preprint, 2024

Zach Evans, CJ Carr, Josiah Taylor, Scott H Hawley, and Jordi Pons. Fast timing- conditioned latent audio diffusion.arXiv Preprint, 2024. 4

work page 2024

[20] [20]

Stable audio open

Zach Evans, Julian D Parker, CJ Carr, Zack Zukowski, Josiah Taylor, and Jordi Pons. Stable audio open. InICASSP, 2025. 4, 6, 8

work page 2025

[21] [21]

Flux that plays music.arXiv Preprint, 2024

Zhengcong Fei, Mingyuan Fan, Changqian Yu, and Junshi Huang. Flux that plays music.arXiv Preprint, 2024. 1, 4, 6

work page 2024

[22] [22]

Visualizing musical structure and rhythm via self-similarity

Jonathan Foote and Matthew Cooper. Visualizing musical structure and rhythm via self-similarity. InICMC, 2001. 6

work page 2001

[23] [23]

Text-to-audio generation using instruction guided latent diffusion model

Deepanway Ghosal, Navonil Majumder, Ambuj Mehrish, and Soujanya Poria. Text-to-audio generation using instruction guided latent diffusion model. InACM MM, 2023. 4, 6

work page 2023

[24] [24]

ACE-Step: A step towards music generation foundation model.arXiv Preprint, 2025

Junmin Gong, Sean Zhao, Sen Wang, Shengyuan Xu, and Joe Guo. ACE-Step: A step towards music generation foundation model.arXiv Preprint, 2025. 1, 4

work page 2025

[25] [25]

The llama 3 herd of models.arXiv Preprint, 2024

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Ab- hishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv Preprint, 2024. 14

work page 2024

[26] [26]

it’s more of a vibe i’m going for

Noor Hammad, C Ailie Fraser, Erik Harpstead, Jessica Hammer, and Mira Dontcheva. “it’s more of a vibe i’m going for”: Designing text-to-music gener- ation interfaces for video creators. InDIS, 2025. 2, 3, 5, 8, 14, 16

work page 2025

[27] [27]

Cnn architectures for large-scale audio classification

Shawn Hershey, Sourish Chaudhuri, Daniel PW Ellis, Jort F Gemmeke, Aren Jansen, R Channing Moore, Manoj Plakal, Devin Platt, Rif A Saurous, Bryan Seybold, et al. Cnn architectures for large-scale audio classification. InICASSP,

work page

[28] [28]

Classifier-freediffusionguidance.arXiv Preprint,

JonathanHoandTimSalimans. Classifier-freediffusionguidance.arXiv Preprint,

work page

[29] [29]

A simple sequentially rejective multiple test procedure.Scandinavian journal of statistics, 1979

Sture Holm. A simple sequentially rejective multiple test procedure.Scandinavian journal of statistics, 1979. 11

work page 1979

[30] [30]

Make-an-audio: Text-to- audio generation with prompt-enhanced diffusion models

Rongjie Huang, Jiawei Huang, Dongchao Yang, Yi Ren, Luping Liu, Mingze Li, Zhenhui Ye, Jinglin Liu, Xiang Yin, and Zhou Zhao. Make-an-audio: Text-to- audio generation with prompt-enhanced diffusion models. InICML, 2023. 4

work page 2023

[31] [31]

MusiScene: Leveraging mu-llama forscene imaginationand enhancedvideo background music generation

Fathinah Izzati, Xinyue Li, Yuxuan Wu, and Gus Xia. MusiScene: Leveraging mu-llama forscene imaginationand enhancedvideo background music generation. arXiv Preprint, 2025. 4

work page 2025

[32] [32]

Video2music: Suitable music generation from videos using an affective multimodal transformer model

Jaeyong Kang, Soujanya Poria, and Dorien Herremans. Video2music: Suitable music generation from videos using an affective multimodal transformer model. arXiv Preprint, 2023. 4

work page 2023

[33] [33]

Cotracker: It is better to track together

Nikita Karaev, Ignacio Rocco, Benjamin Graham, Natalia Neverova, Andrea Vedaldi, and Christian Rupprecht. Cotracker: It is better to track together. In ECCV, 2024. 8, 12, 14

work page 2024

[34] [34]

Fr\’echet audio distance: A metric for evaluating music enhancement algorithms

Kevin Kilgour, Mauricio Zuluaga, Dominik Roblek, and Matthew Sharifi. Fr\’echet audio distance: A metric for evaluating music enhancement algorithms. arXiv Preprint, 2018. 9

work page 2018

[35] [35]

Video-guided text-to-music generation using public domain movie collections

Haven Kim, Zachary Novack, Weihan Xu, Julian McAuley, and Hao-Wen Dong. Video-guided text-to-music generation using public domain movie collections. In ISMIR, 2025. 2, 4, 8, 10

work page 2025

[36] [36]

Audiogen: Textually guided audio generation

Felix Kreuk, Gabriel Synnaeve, Adam Polyak, Uriel Singer, Alexandre Défossez, Jade Copet, Devi Parikh, Yaniv Taigman, and Yossi Adi. Audiogen: Textually guided audio generation. InICLR, 2023. 4

work page 2023

[37] [37]

High-fidelity audio compression with improved RVQGAN

Rithesh Kumar, Prem Seetharaman, Alejandro Luebs, Ishaan Kumar, and Kun- dan Kumar. High-fidelity audio compression with improved RVQGAN. In NeurIPS, 2023. 4 V2M-Zero23

work page 2023

[38] [38]

VinTAGe: Joint video and text conditioning for holistic audio generation

Saksham Singh Kushwaha and Yapeng Tian. VinTAGe: Joint video and text conditioning for holistic audio generation. InCVPR, 2025. 4

work page 2025

[39] [39]

Learning self- similarity in space and time as generalized motion for video action recognition

Heeseung Kwon, Manjin Kim, Suha Kwak, and Minsu Cho. Learning self- similarity in space and time as generalized motion for video action recognition. InICCV, 2021. 6

work page 2021

[40] [40]

Efficient neural music generation

Max WY Lam, Qiao Tian, Tang Li, Zongyu Yin, Siyuan Feng, Ming Tu, Yuliang Ji, Rui Xia, Mingbo Ma, Xuchen Song, et al. Efficient neural music generation. InNeurIPS, 2023. 1, 4

work page 2023

[41] [41]

Dancing to music

Hsin-Ying Lee, Xiaodong Yang, Ming-Yu Liu, Ting-Chun Wang, Yu-Ding Lu, Ming-Hsuan Yang, and Jan Kautz. Dancing to music. InNeurIPS, 2019. 4

work page 2019

[42] [42]

Mozart’s touch: a lightweight multimodal music generation framework based on pre-trained large models

Jiajun Li, Tianze Xu, Xuesong Chen, Xinrui Yao, Jingchou Han, and Shuchang Liu. Mozart’s touch: a lightweight multimodal music generation framework based on pre-trained large models. InAIGC, 2025. 4

work page 2025

[43] [43]

AI choreographer: Music conditioned 3d dance generation with AIST++

Ruilong Li, Shan Yang, David A Ross, and Angjoo Kanazawa. AI choreographer: Music conditioned 3d dance generation with AIST++. InICCV, 2021. 9, 11, 12

work page 2021

[44] [44]

MuVi: Video-to-music generation with semantic alignment and rhythmic synchro- nization.arXiv Preprint, 2024

Ruiqi Li, Siqi Zheng, Xize Cheng, Ziang Zhang, Shengpeng Ji, and Zhou Zhao. MuVi: Video-to-music generation with semantic alignment and rhythmic synchro- nization.arXiv Preprint, 2024. 2, 4

work page 2024

[45] [45]

Dance-to-music generation with encoder- based textual inversion

Sifei Li, Weiming Dong, Yuxin Zhang, Fan Tang, Chongyang Ma, Oliver Deussen, Tong-Yee Lee, and Changsheng Xu. Dance-to-music generation with encoder- based textual inversion. InSIGGRAPH Asia, 2024. 4, 9, 12, 16

work page 2024

[46] [46]

Diff-BGM: A diffusion model for video background music generation

Sizhe Li, Yiming Qin, Minghang Zheng, Xin Jin, and Yang Liu. Diff-BGM: A diffusion model for video background music generation. InCVPR, 2024. 4

work page 2024

[47] [47]

VidMusician: Video-to-music generation with semantic-rhythmic alignment via hierarchical visual features.arXiv Preprint, 2024

Sifei Li, Binxin Yang, Chunji Yin, Chong Sun, Yuxin Zhang, Weiming Dong, and Chen Li. VidMusician: Video-to-music generation with semantic-rhythmic alignment via hierarchical visual features.arXiv Preprint, 2024. 2, 4

work page 2024

[48] [48]

Siamese vision transformers are scalable audio- visual learners

Yan-Bo Lin and Gedas Bertasius. Siamese vision transformers are scalable audio- visual learners. InECCV, 2024. 8, 13

work page 2024

[49] [49]

VMAS: Video-to-music generationviasemanticalignment inweb musicvideos

Yan-Bo Lin, Yu Tian, Linjie Yang, Gedas Bertasius, and Heng Wang. VMAS: Video-to-music generationviasemanticalignment inweb musicvideos. InW ACV,

work page

[50] [50]

AudioLDM: Text-to-audio generation with latent diffusion models

Haohe Liu, Zehua Chen, Yi Yuan, Xinhao Mei, Xubo Liu, Danilo Mandic, Wenwu Wang, and Mark D Plumbley. AudioLDM: Text-to-audio generation with latent diffusion models. InICML, 2023. 4, 6

work page 2023

[51] [51]

AudioLDM 2: Learn- ing holistic audio generation with self-supervised pretraining.arXiv Preprint,

Haohe Liu, Qiao Tian, Yi Yuan, Xubo Liu, Xinhao Mei, Qiuqiang Kong, Yuping Wang, Wenwu Wang, Yuxuan Wang, and Mark D Plumbley. AudioLDM 2: Learn- ing holistic audio generation with self-supervised pretraining.arXiv Preprint,

work page

[52] [52]

ThinkSound: Chain-of-thought reasoning in multimodal large language models for audio generation and editing

Huadai Liu, Jialei Wang, Kaicheng Luo, Wen Wang, Qian Chen, Zhou Zhao, and Wei Xue. ThinkSound: Chain-of-thought reasoning in multimodal large language models for audio generation and editing. InNeurIPS, 2025. 4

work page 2025

[53] [53]

M2 UGen: Multi-modal music understanding and generation with the power of large language models.arXiv Preprint, 2023

Shansong Liu, Atin Sakkeer Hussain, Qilong Wu, Chenshuo Sun, and Ying Shan. M2 UGen: Multi-modal music understanding and generation with the power of large language models.arXiv Preprint, 2023. 4

work page 2023

[54] [54]

MuMu-LLaMA: Multi-modal music understanding and generation via large lan- guage models.arXiv Preprint, 2024

Shansong Liu, Atin Sakkeer Hussain, Qilong Wu, Chenshuo Sun, and Ying Shan. MuMu-LLaMA: Multi-modal music understanding and generation via large lan- guage models.arXiv Preprint, 2024. 9, 10, 18

work page 2024

[55] [55]

Tell what you hear from what you see-video to audio generation through text

Xiulong Liu, Kun Su, and Eli Shlizerman. Tell what you hear from what you see-video to audio generation through text. InNeurIPS, 2024. 4

work page 2024

[56] [56]

Extending visual dy- namics for video-to-music generation.arXiv Preprint, 2025

Xiaohao Liu, Teng Tu, Yunshan Ma, and Tat-Seng Chua. Extending visual dy- namics for video-to-music generation.arXiv Preprint, 2025. 2, 4 24 Lin et al

work page 2025

[57] [57]

SongGen: A single stage auto- regressive transformer for text-to-song generation

Zihan Liu, Shuangrui Ding, Zhixiong Zhang, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Dahua Lin, and Jiaqi Wang. SongGen: A single stage auto- regressive transformer for text-to-song generation. InICML, 2025. 1, 4

work page 2025

[58] [58]

Diff-Foley: Synchro- nized video-to-audio synthesis with latent diffusion models

Simian Luo, Chuanhao Yan, Chenxu Hu, and Hang Zhao. Diff-Foley: Synchro- nized video-to-audio synthesis with latent diffusion models. InNeurIPS, 2023. 4

work page 2023

[59] [59]

The song describer dataset: a corpus of audio captions for music-and-language evaluation.arXiv Preprint, 2023

Ilaria Manco, Benno Weck, Seungheon Doh, Minz Won, Yixiao Zhang, Dmitry Bogdanov, Yusong Wu, Ke Chen, Philip Tovstogan, Emmanouil Benetos, et al. The song describer dataset: a corpus of audio captions for music-and-language evaluation.arXiv Preprint, 2023. 9, 10

work page 2023

[60] [60]

FoleyGen: Visually-guided audio generation.arXiv Preprint, 2023

Xinhao Mei, Varun Nagaraja, Gael Le Lan, Zhaoheng Ni, Ernie Chang, Yangyang Shi, and Vikas Chandra. FoleyGen: Visually-guided audio generation.arXiv Preprint, 2023. 4

work page 2023

[61] [61]

Mustango: Toward controllable text-to-music generation.arXiv Preprint, 2023

Jan Melechovsky, Zixun Guo, Deepanway Ghosal, Navonil Majumder, Dorien Herremans, and Soujanya Poria. Mustango: Toward controllable text-to-music generation.arXiv Preprint, 2023. 4, 6

work page 2023

[62] [62]

Fast text-to-audio generation with adversarial post-training

Zachary Novack, Zach Evans, Zack Zukowski, Josiah Taylor, CJ Carr, Julian Parker, Adnan Al-Sinan, Gian Marco Iodice, Julian McAuley, Taylor Berg- Kirkpatrick, et al. Fast text-to-audio generation with adversarial post-training. arXiv Preprint, 2025. 4

work page 2025

[63] [63]

DITTO: Diffusion inference-time t-optimization for music generation

ZacharyNovack,JulianMcAuley,TaylorBerg-Kirkpatrick,andNicholasJ.Bryan. DITTO: Diffusion inference-time t-optimization for music generation. InICML,

work page

[64] [64]

Zachary Novack, Ge Zhu, Jonah Casebeer, Julian McAuley, Taylor Berg- Kirkpatrick, and Nicholas J. Bryan. Presto! distilling steps and layers for ac- celerating music generation. InICLR, 2025. 8

work page 2025

[65] [65]

DINOv2: Learning robust visual features without supervision.arXiv Preprint, 2023

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El- Nouby, et al. DINOv2: Learning robust visual features without supervision.arXiv Preprint, 2023. 7, 8, 10, 13, 18

work page 2023

[66] [66]

Visually indicated sounds

Andrew Owens, Phillip Isola, Josh McDermott, Antonio Torralba, Edward H Adelson, and William T Freeman. Visually indicated sounds. InCVPR, 2016. 4

work page 2016

[67] [67]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. In ICCV, pages 4195–4205, 2023. 6

work page 2023

[68] [68]

Self-similarity-based and novelty-based loss for music structure analysis

Geoffroy Peeters. Self-similarity-based and novelty-based loss for music structure analysis. InInternational Society of Music Information Retreival, 2023. 6

work page 2023

[69] [69]

Movie Gen: A cast of media foundation models.arXiv Preprint, 2024

Adam Polyak, Amit Zohar, Andrew Brown, Andros Tjandra, Animesh Sinha, Ann Lee, Apoorv Vyas, Bowen Shi, Chih-Yao Ma, Ching-Yao Chuang, et al. Movie Gen: A cast of media foundation models.arXiv Preprint, 2024. 9, 10

work page 2024

[70] [70]

Customized condition controllable generation for video soundtrack

Fan Qi, Kunsheng Ma, and Changsheng Xu. Customized condition controllable generation for video soundtrack. InCVPR, 2025. 4

work page 2025

[71] [71]

Robust speech recognition via large-scale weak supervision

Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision. InICML, 2023. 16

work page 2023

[72] [72]

Foley control: Aligning a frozen latent text-to-audio model to video

Ciara Rowles, Varun Jampani, Simon Donné, Shimon Vainer, Julian Parker, and Zach Evans. Foley control: Aligning a frozen latent text-to-audio model to video. arXiv Preprint, 2025. 4

work page 2025

[73] [73]

Moûsai: Text-to-music generation with long-context latent diffusion.arXiv Preprint, 2023

Flavio Schneider, Ojasv Kamal, Zhijing Jin, and Bernhard Schölkopf. Moûsai: Text-to-music generation with long-context latent diffusion.arXiv Preprint, 2023. 1, 4

work page 2023

[74] [74]

Hunyuanvideo-foley: Multimodal diffusion with representation alignment for high-fidelity foley audio generation.arXiv Preprint, 2025

Sizhe Shan, Qiulin Li, Yutao Cui, Miles Yang, Yuehai Wang, Qun Yang, Jin Zhou, and Zhao Zhong. Hunyuanvideo-foley: Multimodal diffusion with representation alignment for high-fidelity foley audio generation.arXiv Preprint, 2025. 17 V2M-Zero25

work page 2025

[75] [75]

M2M-Gen: A multimodal framework for automated background music generation in japanese manga using large language models.arXiv Preprint, 2024

Megha Sharma, Muhammad Taimoor Haseeb, Gus Xia, and Yoshimasa Tsuruoka. M2M-Gen: A multimodal framework for automated background music generation in japanese manga using large language models.arXiv Preprint, 2024. 4

work page 2024

[76] [76]

Matching local self-similarities across images and videos

Eli Shechtman and Michal Irani. Matching local self-similarities across images and videos. InCVPR, 2007. 6

work page 2007

[77] [77]

Audio to body dynamics

Eli Shlizerman, Lucio Dery, Hayden Schoen, and Ira Kemelmacher-Shlizerman. Audio to body dynamics. InCVPR, 2018. 4

work page 2018

[78] [78]

V2Meow: Me- owing to the visual beat via music generation

Kun Su, Judith Yue Li, Qingqing Huang, Dima Kuzmin, Joonseok Lee, Chris Donahue, Fei Sha, Aren Jansen, Yu Wang, Mauro Verzetti, et al. V2Meow: Me- owing to the visual beat via music generation. InAAAI, 2024. 4

work page 2024

[79] [79]

From vision to audio and beyond: A unified model for audio-visual representation and generation

Kun Su, Xiulong Liu, and Eli Shlizerman. From vision to audio and beyond: A unified model for audio-visual representation and generation. InICML, 2024. 4

work page 2024

[80] [80]

Enhancing dance- to-music generation via negative conditioning latent diffusion model

Changchang Sun, Gaowen Liu, Charles Fleming, and Yan Yan. Enhancing dance- to-music generation via negative conditioning latent diffusion model. InCVPR,

work page