pith. sign in

arxiv: 2603.11042 · v2 · pith:UMFKTNZTnew · submitted 2026-03-11 · 💻 cs.CV · cs.AI· cs.LG· cs.MM· cs.SD

V2M-Zero: Zero-Pair Time-Aligned Video-to-Music Generation

Pith reviewed 2026-05-15 13:15 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LGcs.MMcs.SD
keywords video-to-music generationzero-pair learningtemporal alignmentevent curvesintra-modal similaritydisentangled controltext-to-music fine-tuning
0
0 comments X

The pith

Event curves from intra-modal similarities enable zero-pair training for time-aligned video-to-music generation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that temporal alignment between video and generated music can be achieved by computing event curves separately within each modality from pretrained encoders, then substituting video curves into a music model at inference. This avoids any requirement for paired video-music training data while disentangling timing control from semantic factors such as genre or mood. A reader would care because paired cross-modal datasets are costly and scarce, and the reported results show higher performance than paired baselines across objective metrics and human listening tests. The core insight is that synchronization depends on matching the timing and magnitude of changes rather than their semantic content.

Core claim

We introduce V2M-ZERO, a video-to-music generation approach that generates time-aligned music with disentangled time synchronization and semantic control from video while requiring zero video-music pairs at training time. Temporal synchronization requires matching when and how much change occurs, not what changes. Shared temporal structure is captured independently within each modality through event curves computed from intra-modal similarity using pretrained music and video encoders. These curves provide comparable representations across modalities, enabling a training strategy of fine-tuning a text-to-music model on music-event curves and substituting video-event curves at inference. This

What carries the argument

Event curves computed from intra-modal similarity using pretrained encoders, which capture the timing and magnitude of changes independently per modality to allow direct substitution for alignment.

If this is right

  • Surpasses prior baselines with 5-9% higher audio quality, 13-15% better semantic alignment, and 21-52% improved temporal synchronization without any paired data.
  • Achieves 28% higher beat alignment on dance videos from the AIST++ benchmark.
  • Delivers comparable gains in large-scale crowd-sourced subjective listening tests.
  • Enables separate controls for timing via event curves and for style via text prompts such as genre or mood.
  • Validates that within-modality temporal features outperform paired cross-modal supervision for this task.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The intra-modal curve approach could transfer to other cross-modal tasks where paired data is limited, such as aligning generated audio to text descriptions or images.
  • Treating temporal structure as modality-independent may lower data requirements for training multimodal generators in domains like film scoring or dance music synthesis.
  • If the substitution works reliably, it opens a path to iterative refinement where timing curves are edited independently of semantic prompts.

Load-bearing premise

Event curves from intra-modal similarities using pretrained encoders provide comparable representations across video and music modalities that can be directly substituted at inference without cross-modal training.

What would settle it

On benchmarks such as OES-Pub or MovieGenBench-Music, if models using substituted video event curves show no improvement or worse temporal synchronization and beat alignment metrics than the strongest paired cross-modal baselines, the claim is falsified.

Figures

Figures reproduced from arXiv: 2603.11042 by Aniruddha Mahapatra, Gedas Bertasius, Jonah Casebeer, Long Mai, Nicholas J. Bryan, Yan-Bo Lin.

Figure 1
Figure 1. Figure 1: Zero-Pair Video-to-Music Generation Top: Generating music for video commonly requires large-scale collections of high-quality, paired video-music data. Middle: Our V2M-Zero method is trained only on text–music pairs with an ad￾ditional music-event curve condition (no video). Bottom: At inference, we swap a music-event curve with aligned video-event curves extracted via off-the-shelf vision models and gener… view at source ↗
Figure 2
Figure 2. Figure 2: Shared Temporal Structure Across Modalities. Real event curves computed from video and music ex￾hibit similar temporal patterns across diverse video scenarios. Ground-truth pairs have correlation ≈ 0.6, intro￾ducing random offsets degrades this to ≈ 0.2. 1 In practice, video and music synchro￾nization often corresponds to (sparse) moments of interest or events over time (e.g., video events of dancing and s… view at source ↗
Figure 3
Figure 3. Figure 3: Method Overview Top: During training, V2M-Zero learns a rectified-flow diffusion process conditioned on text prompts and a music-event curve derived from intra-music similarity. Bottom: At inference, music conditioning is swapped with a video-event curve based on framewise similarity, enabling zero-pair, time-synchronized video-to-music generation. For semantic alignment, a text prompt is predicted from th… view at source ↗
Figure 4
Figure 4. Figure 4: Impact of Smoothing Kernel Size. Larger kernerls im￾prove audio quality (FAD*) but temporal alignment (SCH) has an optimal point on OES-Pub. We systematically ablate four design axes: (i) kernel size for modality gap mitigation, (ii) encoders for event-curve extraction, (iii) domain-specific visual encoders, and (iv) LLM selection for prompt generation. Mitigating Modality Gap. Music-event curves (training… view at source ↗
Figure 5
Figure 5. Figure 5: Example event curves with different temporal dynamics. [PITH_FULL_IMAGE:figures/full_fig_p019_5.png] view at source ↗
read the original abstract

Generating music that temporally aligns with video events is challenging for existing text-to-music models, which lack fine-grained temporal control. We introduce V2M-ZERO, a video-to-music generation approach that generates time-aligned music with disentangled time synchronization and semantic control (e.g., genre, mood) from video while requiring zero video-music pairs at training time. Our method is motivated by a key observation: temporal synchronization requires matching when and how much change occurs, not what changes. While musical and visual events differ semantically, they exhibit shared temporal structure that can be captured independently within each modality. We capture this structure through event curves computed from intra-modal similarity using pretrained music and video encoders. By measuring temporal change within each modality independently, these curves provide comparable representations across modalities. This enables a simple training strategy: fine-tune a text-to-music model on music-event curves, then substitute video-event curves at inference without cross-modal training or paired data. Across OES-Pub, MovieGenBench-Music, and AIST++, V2M-ZERO achieves state-of-the-art performance without any paired music-video data, surpassing the strongest prior baselines per metric with 5-9% higher audio quality, 13-15% better semantic alignment, 21-52% improved temporal synchronization, and 28% higher beat alignment on dance videos. We find similar results via a large crowd-source subjective listening test. Our results validate that temporal alignment through within-modality features is not only effective for video-to-music generation but also leads to better performance than paired cross-modal supervision. Furthermore, our approach enables independent controls for timing and music style (e.g., genre, mood) for more controllable generation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper presents V2M-Zero, a zero-pair method for video-to-music generation that achieves temporal alignment by computing event curves independently from intra-modal similarities using pretrained video and music encoders. It fine-tunes a text-to-music model on music event curves only, then substitutes video event curves at inference to control timing while allowing separate semantic control (e.g., genre, mood). The approach reports state-of-the-art results across OES-Pub, MovieGenBench-Music, and AIST++ with gains of 5-9% in audio quality, 13-15% in semantic alignment, 21-52% in temporal synchronization, and 28% in beat alignment, plus subjective validation, claiming superiority over paired cross-modal supervision.

Significance. If the central claim holds, the work would be significant for showing that within-modality temporal structures can substitute for paired data in cross-modal alignment tasks, enabling more controllable generation and reducing dependence on expensive paired datasets. This has potential implications for zero-shot multimodal synthesis in computer vision and audio, provided the event-curve equivalence is validated.

major comments (1)
  1. [Method (event curve substitution)] Method (event curve substitution, as described in abstract and method): The claim that event curves from pretrained video and music encoders provide comparable representations for direct substitution rests on the unverified assumption of shared temporal structure without reported normalization, moment-matching, distributional comparison, or dynamic-range analysis between modalities. This is load-bearing for the zero-pair training strategy and the reported temporal gains (21-52%); unaccounted domain shift could attribute improvements to model robustness instead.
minor comments (2)
  1. [Abstract and results] Abstract and results: Exact baselines, statistical significance tests, and controls for data leakage are not detailed for the reported metric improvements across the three benchmarks.
  2. [Subjective evaluation] Subjective evaluation: Additional specifics on the crowd-sourced listening test (participant count, rating scales, and statistical analysis) would strengthen the validation claims.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their thoughtful review and for identifying a key methodological assumption in V2M-Zero. We address the concern point-by-point below and will incorporate additional analysis in the revision.

read point-by-point responses
  1. Referee: The claim that event curves from pretrained video and music encoders provide comparable representations for direct substitution rests on the unverified assumption of shared temporal structure without reported normalization, moment-matching, distributional comparison, or dynamic-range analysis between modalities. This is load-bearing for the zero-pair training strategy and the reported temporal gains (21-52%); unaccounted domain shift could attribute improvements to model robustness instead.

    Authors: We agree that explicit verification strengthens the central claim. Event curves are computed identically within each modality: given a sequence of embeddings from a pretrained encoder, we form a temporal similarity matrix and derive a normalized change curve (scaled to [0,1] by the maximum intra-sequence difference). This formulation ensures both curves measure the same quantity—magnitude and timing of change—independent of semantic content. While the original submission did not include side-by-side distributional plots, the 21-52% temporal synchronization gains over paired baselines across three datasets would be unlikely if domain shift were dominant; such shift would typically harm rather than enhance alignment. In the revised manuscript we will add a dedicated subsection with (i) normalization details, (ii) histogram comparisons of video vs. music curve values, and (iii) dynamic-range statistics, confirming the curves occupy statistically similar ranges. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained via independent intra-modal computations

full rationale

The paper computes event curves separately within each modality from pretrained encoders, fine-tunes the text-to-music model exclusively on music-derived curves, and substitutes video curves at inference under the explicit assumption of shared temporal structure. No equation reduces the alignment metric or generated output to a fitted parameter defined from the same data or to a self-citation chain. The central zero-pair claim rests on empirical validation across separate benchmarks rather than on any definitional equivalence or imported uniqueness theorem. This is the expected non-finding for a method whose core steps remain externally verifiable.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on one domain assumption about cross-modal comparability of event curves and introduces one invented representation; no free parameters are explicitly fitted in the abstract description.

axioms (1)
  • domain assumption Event curves derived from intra-modal similarity in pretrained video and music encoders capture comparable temporal structure across modalities.
    Invoked to justify substituting video-event curves for music-event curves at inference without paired training.
invented entities (1)
  • event curves no independent evidence
    purpose: To represent timing of changes independently within each modality for alignment.
    New representation introduced to enable zero-pair substitution.

pith-pipeline@v0.9.0 · 5642 in / 1442 out tokens · 68158 ms · 2026-05-15T13:15:18.525988+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

111 extracted references · 111 canonical work pages

  1. [1]

    MusicLM: Generating music from text.arXiv Preprint, 2023

    Andrea Agostinelli, Timo I Denk, Zalán Borsos, Jesse Engel, Mauro Verzetti, An- toine Caillon, Qingqing Huang, Aren Jansen, Adam Roberts, Marco Tagliasacchi, et al. MusicLM: Generating music from text.arXiv Preprint, 2023. 1, 4

  2. [2]

    V-JEPA 2: Self-supervised video models enable understanding, prediction and planning.arXiv Preprint, 2025

    Mido Assran, Adrien Bardes, David Fan, Quentin Garrido, Russell Howes, Matthew Muckley, Ammar Rizvi, Claire Roberts, Koustuv Sinha, Artem Zholus, et al. V-JEPA 2: Self-supervised video models enable understanding, prediction and planning.arXiv Preprint, 2025. 8, 13

  3. [3]

    Yatong Bai, Jonah Casebeer, Somayeh Sojoudi, and Nicholas J. Bryan. DRAGON: Distributional rewards optimize diffusion generative models.TMLR,

  4. [4]

    AudioLM: a language modeling approach to audio generation

    Zalán Borsos, Raphaël Marinier, Damien Vincent, Eugene Kharitonov, Olivier Pietquin, Matt Sharifi, Dominik Roblek, Olivier Teboul, David Grangier, Marco Tagliasacchi, et al. AudioLM: a language modeling approach to audio generation. TASLP, 2023. 4

  5. [5]

    Re-bottleneck: Latent re-structuring for neural audio autoencoders

    Dimitrios Bralios, Jonah Casebeer, and Paris Smaragdis. Re-bottleneck: Latent re-structuring for neural audio autoencoders. InMLSP, 2025. 8

  6. [6]

    Learning to upsample and upmix audio in the latent domain.arXiv Preprint, 2025

    Dimitrios Bralios, Paris Smaragdis, and Jonah Casebeer. Learning to upsample and upmix audio in the latent domain.arXiv Preprint, 2025. 8

  7. [7]

    A generative-first neural audio autoencoder.arXiv:2602.15749, 2026

    Jonah Casebeer, Ge Zhu, Zhepei Wang, and Nicholas J Bryan. A generative-first neural audio autoencoder.arXiv:2602.15749, 2026. 5, 13

  8. [8]

    Visually indicated sound generation by perceptually optimized classifi- cation

    Kan Chen, Chuanxi Zhang, Chen Fang, Zhaowen Wang, Trung Bui, and Ram Nevatia. Visually indicated sound generation by perceptually optimized classifi- cation. InECCVW, 2018. 4

  9. [9]

    Generating visually aligned sound from videos.TIP, 2020

    Peihao Chen, Yang Zhang, Mingkui Tan, Hongdong Xiao, Deng Huang, and Chuang Gan. Generating visually aligned sound from videos.TIP, 2020. 4

  10. [10]

    Images that sound: Composing images and sounds on a single canvas

    Ziyang Chen, Daniel Geng, and Andrew Owens. Images that sound: Composing images and sounds on a single canvas. InNeurIPS, 2024. 4

  11. [11]

    Video-guided foley sound generation with multimodal controls

    Ziyang Chen, Prem Seetharaman, Bryan Russell, Oriol Nieto, David Bourgin, Andrew Owens, and Justin Salamon. Video-guided foley sound generation with multimodal controls. InCVPR, 2025. 4

  12. [12]

    Mmaudio: Taming multimodal joint training for high-quality video-to-audio synthesis

    Ho Kei Cheng, Masato Ishii, Akio Hayakawa, Takashi Shibuya, Alexander Schwing, and Yuki Mitsufuji. Mmaudio: Taming multimodal joint training for high-quality video-to-audio synthesis. InCVPR, 2025. 4

  13. [13]

    MeLFusion: Synthesizing music from image and language cues using diffusion models

    Sanjoy Chowdhury, Sayan Nag, KJ Joseph, Balaji Vasan Srinivasan, and Dinesh Manocha. MeLFusion: Synthesizing music from image and language cues using diffusion models. InCVPR, 2024. 2, 4

  14. [14]

    Cocola: Coherence-oriented con- trastive learning of musical audio representations.arXiv Preprint, 2024

    Ruben Ciranni, Giorgio Mariani, Michele Mancusi, Emilian Postolache, Giorgio Fabbro, Emanuele Rodolà, and Luca Cosmo. Cocola: Coherence-oriented con- trastive learning of musical audio representations.arXiv Preprint, 2024. 12

  15. [15]

    Simple and controllable music generation

    Jade Copet, Felix Kreuk, Itai Gat, Tal Remez, David Kant, Gabriel Synnaeve, Yossi Adi, and Alexandre Défossez. Simple and controllable music generation. In NeurIPS, 2023. 1, 4

  16. [16]

    High fidelity neural audio compression.arXiv Preprint, 2022

    Alexandre Défossez, Jade Copet, Gabriel Synnaeve, and Yossi Adi. High fidelity neural audio compression.arXiv Preprint, 2022. 4

  17. [17]

    Video background music generation with controllable music transformer

    Shangzhe Di, Zeren Jiang, Si Liu, Zhaokai Wang, Leyan Zhu, Zexin He, Hongming Liu, and Shuicheng Yan. Video background music generation with controllable music transformer. InACM MM, 2021. 4, 12

  18. [18]

    Conditional generation of audio from video via foley analogies

    Yuexi Du, Ziyang Chen, Justin Salamon, Bryan Russell, and Andrew Owens. Conditional generation of audio from video via foley analogies. InCVPR, 2023. 4 22 Lin et al

  19. [19]

    Fast timing- conditioned latent audio diffusion.arXiv Preprint, 2024

    Zach Evans, CJ Carr, Josiah Taylor, Scott H Hawley, and Jordi Pons. Fast timing- conditioned latent audio diffusion.arXiv Preprint, 2024. 4

  20. [20]

    Stable audio open

    Zach Evans, Julian D Parker, CJ Carr, Zack Zukowski, Josiah Taylor, and Jordi Pons. Stable audio open. InICASSP, 2025. 4, 6, 8

  21. [21]

    Flux that plays music.arXiv Preprint, 2024

    Zhengcong Fei, Mingyuan Fan, Changqian Yu, and Junshi Huang. Flux that plays music.arXiv Preprint, 2024. 1, 4, 6

  22. [22]

    Visualizing musical structure and rhythm via self-similarity

    Jonathan Foote and Matthew Cooper. Visualizing musical structure and rhythm via self-similarity. InICMC, 2001. 6

  23. [23]

    Text-to-audio generation using instruction guided latent diffusion model

    Deepanway Ghosal, Navonil Majumder, Ambuj Mehrish, and Soujanya Poria. Text-to-audio generation using instruction guided latent diffusion model. InACM MM, 2023. 4, 6

  24. [24]

    ACE-Step: A step towards music generation foundation model.arXiv Preprint, 2025

    Junmin Gong, Sean Zhao, Sen Wang, Shengyuan Xu, and Joe Guo. ACE-Step: A step towards music generation foundation model.arXiv Preprint, 2025. 1, 4

  25. [25]

    The llama 3 herd of models.arXiv Preprint, 2024

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Ab- hishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv Preprint, 2024. 14

  26. [26]

    it’s more of a vibe i’m going for

    Noor Hammad, C Ailie Fraser, Erik Harpstead, Jessica Hammer, and Mira Dontcheva. “it’s more of a vibe i’m going for”: Designing text-to-music gener- ation interfaces for video creators. InDIS, 2025. 2, 3, 5, 8, 14, 16

  27. [27]

    Cnn architectures for large-scale audio classification

    Shawn Hershey, Sourish Chaudhuri, Daniel PW Ellis, Jort F Gemmeke, Aren Jansen, R Channing Moore, Manoj Plakal, Devin Platt, Rif A Saurous, Bryan Seybold, et al. Cnn architectures for large-scale audio classification. InICASSP,

  28. [28]

    Classifier-freediffusionguidance.arXiv Preprint,

    JonathanHoandTimSalimans. Classifier-freediffusionguidance.arXiv Preprint,

  29. [29]

    A simple sequentially rejective multiple test procedure.Scandinavian journal of statistics, 1979

    Sture Holm. A simple sequentially rejective multiple test procedure.Scandinavian journal of statistics, 1979. 11

  30. [30]

    Make-an-audio: Text-to- audio generation with prompt-enhanced diffusion models

    Rongjie Huang, Jiawei Huang, Dongchao Yang, Yi Ren, Luping Liu, Mingze Li, Zhenhui Ye, Jinglin Liu, Xiang Yin, and Zhou Zhao. Make-an-audio: Text-to- audio generation with prompt-enhanced diffusion models. InICML, 2023. 4

  31. [31]

    MusiScene: Leveraging mu-llama forscene imaginationand enhancedvideo background music generation

    Fathinah Izzati, Xinyue Li, Yuxuan Wu, and Gus Xia. MusiScene: Leveraging mu-llama forscene imaginationand enhancedvideo background music generation. arXiv Preprint, 2025. 4

  32. [32]

    Video2music: Suitable music generation from videos using an affective multimodal transformer model

    Jaeyong Kang, Soujanya Poria, and Dorien Herremans. Video2music: Suitable music generation from videos using an affective multimodal transformer model. arXiv Preprint, 2023. 4

  33. [33]

    Cotracker: It is better to track together

    Nikita Karaev, Ignacio Rocco, Benjamin Graham, Natalia Neverova, Andrea Vedaldi, and Christian Rupprecht. Cotracker: It is better to track together. In ECCV, 2024. 8, 12, 14

  34. [34]

    Fr\’echet audio distance: A metric for evaluating music enhancement algorithms

    Kevin Kilgour, Mauricio Zuluaga, Dominik Roblek, and Matthew Sharifi. Fr\’echet audio distance: A metric for evaluating music enhancement algorithms. arXiv Preprint, 2018. 9

  35. [35]

    Video-guided text-to-music generation using public domain movie collections

    Haven Kim, Zachary Novack, Weihan Xu, Julian McAuley, and Hao-Wen Dong. Video-guided text-to-music generation using public domain movie collections. In ISMIR, 2025. 2, 4, 8, 10

  36. [36]

    Audiogen: Textually guided audio generation

    Felix Kreuk, Gabriel Synnaeve, Adam Polyak, Uriel Singer, Alexandre Défossez, Jade Copet, Devi Parikh, Yaniv Taigman, and Yossi Adi. Audiogen: Textually guided audio generation. InICLR, 2023. 4

  37. [37]

    High-fidelity audio compression with improved RVQGAN

    Rithesh Kumar, Prem Seetharaman, Alejandro Luebs, Ishaan Kumar, and Kun- dan Kumar. High-fidelity audio compression with improved RVQGAN. In NeurIPS, 2023. 4 V2M-Zero23

  38. [38]

    VinTAGe: Joint video and text conditioning for holistic audio generation

    Saksham Singh Kushwaha and Yapeng Tian. VinTAGe: Joint video and text conditioning for holistic audio generation. InCVPR, 2025. 4

  39. [39]

    Learning self- similarity in space and time as generalized motion for video action recognition

    Heeseung Kwon, Manjin Kim, Suha Kwak, and Minsu Cho. Learning self- similarity in space and time as generalized motion for video action recognition. InICCV, 2021. 6

  40. [40]

    Efficient neural music generation

    Max WY Lam, Qiao Tian, Tang Li, Zongyu Yin, Siyuan Feng, Ming Tu, Yuliang Ji, Rui Xia, Mingbo Ma, Xuchen Song, et al. Efficient neural music generation. InNeurIPS, 2023. 1, 4

  41. [41]

    Dancing to music

    Hsin-Ying Lee, Xiaodong Yang, Ming-Yu Liu, Ting-Chun Wang, Yu-Ding Lu, Ming-Hsuan Yang, and Jan Kautz. Dancing to music. InNeurIPS, 2019. 4

  42. [42]

    Mozart’s touch: a lightweight multimodal music generation framework based on pre-trained large models

    Jiajun Li, Tianze Xu, Xuesong Chen, Xinrui Yao, Jingchou Han, and Shuchang Liu. Mozart’s touch: a lightweight multimodal music generation framework based on pre-trained large models. InAIGC, 2025. 4

  43. [43]

    AI choreographer: Music conditioned 3d dance generation with AIST++

    Ruilong Li, Shan Yang, David A Ross, and Angjoo Kanazawa. AI choreographer: Music conditioned 3d dance generation with AIST++. InICCV, 2021. 9, 11, 12

  44. [44]

    MuVi: Video-to-music generation with semantic alignment and rhythmic synchro- nization.arXiv Preprint, 2024

    Ruiqi Li, Siqi Zheng, Xize Cheng, Ziang Zhang, Shengpeng Ji, and Zhou Zhao. MuVi: Video-to-music generation with semantic alignment and rhythmic synchro- nization.arXiv Preprint, 2024. 2, 4

  45. [45]

    Dance-to-music generation with encoder- based textual inversion

    Sifei Li, Weiming Dong, Yuxin Zhang, Fan Tang, Chongyang Ma, Oliver Deussen, Tong-Yee Lee, and Changsheng Xu. Dance-to-music generation with encoder- based textual inversion. InSIGGRAPH Asia, 2024. 4, 9, 12, 16

  46. [46]

    Diff-BGM: A diffusion model for video background music generation

    Sizhe Li, Yiming Qin, Minghang Zheng, Xin Jin, and Yang Liu. Diff-BGM: A diffusion model for video background music generation. InCVPR, 2024. 4

  47. [47]

    VidMusician: Video-to-music generation with semantic-rhythmic alignment via hierarchical visual features.arXiv Preprint, 2024

    Sifei Li, Binxin Yang, Chunji Yin, Chong Sun, Yuxin Zhang, Weiming Dong, and Chen Li. VidMusician: Video-to-music generation with semantic-rhythmic alignment via hierarchical visual features.arXiv Preprint, 2024. 2, 4

  48. [48]

    Siamese vision transformers are scalable audio- visual learners

    Yan-Bo Lin and Gedas Bertasius. Siamese vision transformers are scalable audio- visual learners. InECCV, 2024. 8, 13

  49. [49]

    VMAS: Video-to-music generationviasemanticalignment inweb musicvideos

    Yan-Bo Lin, Yu Tian, Linjie Yang, Gedas Bertasius, and Heng Wang. VMAS: Video-to-music generationviasemanticalignment inweb musicvideos. InW ACV,

  50. [50]

    AudioLDM: Text-to-audio generation with latent diffusion models

    Haohe Liu, Zehua Chen, Yi Yuan, Xinhao Mei, Xubo Liu, Danilo Mandic, Wenwu Wang, and Mark D Plumbley. AudioLDM: Text-to-audio generation with latent diffusion models. InICML, 2023. 4, 6

  51. [51]

    AudioLDM 2: Learn- ing holistic audio generation with self-supervised pretraining.arXiv Preprint,

    Haohe Liu, Qiao Tian, Yi Yuan, Xubo Liu, Xinhao Mei, Qiuqiang Kong, Yuping Wang, Wenwu Wang, Yuxuan Wang, and Mark D Plumbley. AudioLDM 2: Learn- ing holistic audio generation with self-supervised pretraining.arXiv Preprint,

  52. [52]

    ThinkSound: Chain-of-thought reasoning in multimodal large language models for audio generation and editing

    Huadai Liu, Jialei Wang, Kaicheng Luo, Wen Wang, Qian Chen, Zhou Zhao, and Wei Xue. ThinkSound: Chain-of-thought reasoning in multimodal large language models for audio generation and editing. InNeurIPS, 2025. 4

  53. [53]

    M2 UGen: Multi-modal music understanding and generation with the power of large language models.arXiv Preprint, 2023

    Shansong Liu, Atin Sakkeer Hussain, Qilong Wu, Chenshuo Sun, and Ying Shan. M2 UGen: Multi-modal music understanding and generation with the power of large language models.arXiv Preprint, 2023. 4

  54. [54]

    MuMu-LLaMA: Multi-modal music understanding and generation via large lan- guage models.arXiv Preprint, 2024

    Shansong Liu, Atin Sakkeer Hussain, Qilong Wu, Chenshuo Sun, and Ying Shan. MuMu-LLaMA: Multi-modal music understanding and generation via large lan- guage models.arXiv Preprint, 2024. 9, 10, 18

  55. [55]

    Tell what you hear from what you see-video to audio generation through text

    Xiulong Liu, Kun Su, and Eli Shlizerman. Tell what you hear from what you see-video to audio generation through text. InNeurIPS, 2024. 4

  56. [56]

    Extending visual dy- namics for video-to-music generation.arXiv Preprint, 2025

    Xiaohao Liu, Teng Tu, Yunshan Ma, and Tat-Seng Chua. Extending visual dy- namics for video-to-music generation.arXiv Preprint, 2025. 2, 4 24 Lin et al

  57. [57]

    SongGen: A single stage auto- regressive transformer for text-to-song generation

    Zihan Liu, Shuangrui Ding, Zhixiong Zhang, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Dahua Lin, and Jiaqi Wang. SongGen: A single stage auto- regressive transformer for text-to-song generation. InICML, 2025. 1, 4

  58. [58]

    Diff-Foley: Synchro- nized video-to-audio synthesis with latent diffusion models

    Simian Luo, Chuanhao Yan, Chenxu Hu, and Hang Zhao. Diff-Foley: Synchro- nized video-to-audio synthesis with latent diffusion models. InNeurIPS, 2023. 4

  59. [59]

    The song describer dataset: a corpus of audio captions for music-and-language evaluation.arXiv Preprint, 2023

    Ilaria Manco, Benno Weck, Seungheon Doh, Minz Won, Yixiao Zhang, Dmitry Bogdanov, Yusong Wu, Ke Chen, Philip Tovstogan, Emmanouil Benetos, et al. The song describer dataset: a corpus of audio captions for music-and-language evaluation.arXiv Preprint, 2023. 9, 10

  60. [60]

    FoleyGen: Visually-guided audio generation.arXiv Preprint, 2023

    Xinhao Mei, Varun Nagaraja, Gael Le Lan, Zhaoheng Ni, Ernie Chang, Yangyang Shi, and Vikas Chandra. FoleyGen: Visually-guided audio generation.arXiv Preprint, 2023. 4

  61. [61]

    Mustango: Toward controllable text-to-music generation.arXiv Preprint, 2023

    Jan Melechovsky, Zixun Guo, Deepanway Ghosal, Navonil Majumder, Dorien Herremans, and Soujanya Poria. Mustango: Toward controllable text-to-music generation.arXiv Preprint, 2023. 4, 6

  62. [62]

    Fast text-to-audio generation with adversarial post-training

    Zachary Novack, Zach Evans, Zack Zukowski, Josiah Taylor, CJ Carr, Julian Parker, Adnan Al-Sinan, Gian Marco Iodice, Julian McAuley, Taylor Berg- Kirkpatrick, et al. Fast text-to-audio generation with adversarial post-training. arXiv Preprint, 2025. 4

  63. [63]

    DITTO: Diffusion inference-time t-optimization for music generation

    ZacharyNovack,JulianMcAuley,TaylorBerg-Kirkpatrick,andNicholasJ.Bryan. DITTO: Diffusion inference-time t-optimization for music generation. InICML,

  64. [64]

    Zachary Novack, Ge Zhu, Jonah Casebeer, Julian McAuley, Taylor Berg- Kirkpatrick, and Nicholas J. Bryan. Presto! distilling steps and layers for ac- celerating music generation. InICLR, 2025. 8

  65. [65]

    DINOv2: Learning robust visual features without supervision.arXiv Preprint, 2023

    Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El- Nouby, et al. DINOv2: Learning robust visual features without supervision.arXiv Preprint, 2023. 7, 8, 10, 13, 18

  66. [66]

    Visually indicated sounds

    Andrew Owens, Phillip Isola, Josh McDermott, Antonio Torralba, Edward H Adelson, and William T Freeman. Visually indicated sounds. InCVPR, 2016. 4

  67. [67]

    Scalable diffusion models with transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. In ICCV, pages 4195–4205, 2023. 6

  68. [68]

    Self-similarity-based and novelty-based loss for music structure analysis

    Geoffroy Peeters. Self-similarity-based and novelty-based loss for music structure analysis. InInternational Society of Music Information Retreival, 2023. 6

  69. [69]

    Movie Gen: A cast of media foundation models.arXiv Preprint, 2024

    Adam Polyak, Amit Zohar, Andrew Brown, Andros Tjandra, Animesh Sinha, Ann Lee, Apoorv Vyas, Bowen Shi, Chih-Yao Ma, Ching-Yao Chuang, et al. Movie Gen: A cast of media foundation models.arXiv Preprint, 2024. 9, 10

  70. [70]

    Customized condition controllable generation for video soundtrack

    Fan Qi, Kunsheng Ma, and Changsheng Xu. Customized condition controllable generation for video soundtrack. InCVPR, 2025. 4

  71. [71]

    Robust speech recognition via large-scale weak supervision

    Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision. InICML, 2023. 16

  72. [72]

    Foley control: Aligning a frozen latent text-to-audio model to video

    Ciara Rowles, Varun Jampani, Simon Donné, Shimon Vainer, Julian Parker, and Zach Evans. Foley control: Aligning a frozen latent text-to-audio model to video. arXiv Preprint, 2025. 4

  73. [73]

    Moûsai: Text-to-music generation with long-context latent diffusion.arXiv Preprint, 2023

    Flavio Schneider, Ojasv Kamal, Zhijing Jin, and Bernhard Schölkopf. Moûsai: Text-to-music generation with long-context latent diffusion.arXiv Preprint, 2023. 1, 4

  74. [74]

    Hunyuanvideo-foley: Multimodal diffusion with representation alignment for high-fidelity foley audio generation.arXiv Preprint, 2025

    Sizhe Shan, Qiulin Li, Yutao Cui, Miles Yang, Yuehai Wang, Qun Yang, Jin Zhou, and Zhao Zhong. Hunyuanvideo-foley: Multimodal diffusion with representation alignment for high-fidelity foley audio generation.arXiv Preprint, 2025. 17 V2M-Zero25

  75. [75]

    M2M-Gen: A multimodal framework for automated background music generation in japanese manga using large language models.arXiv Preprint, 2024

    Megha Sharma, Muhammad Taimoor Haseeb, Gus Xia, and Yoshimasa Tsuruoka. M2M-Gen: A multimodal framework for automated background music generation in japanese manga using large language models.arXiv Preprint, 2024. 4

  76. [76]

    Matching local self-similarities across images and videos

    Eli Shechtman and Michal Irani. Matching local self-similarities across images and videos. InCVPR, 2007. 6

  77. [77]

    Audio to body dynamics

    Eli Shlizerman, Lucio Dery, Hayden Schoen, and Ira Kemelmacher-Shlizerman. Audio to body dynamics. InCVPR, 2018. 4

  78. [78]

    V2Meow: Me- owing to the visual beat via music generation

    Kun Su, Judith Yue Li, Qingqing Huang, Dima Kuzmin, Joonseok Lee, Chris Donahue, Fei Sha, Aren Jansen, Yu Wang, Mauro Verzetti, et al. V2Meow: Me- owing to the visual beat via music generation. InAAAI, 2024. 4

  79. [79]

    From vision to audio and beyond: A unified model for audio-visual representation and generation

    Kun Su, Xiulong Liu, and Eli Shlizerman. From vision to audio and beyond: A unified model for audio-visual representation and generation. InICML, 2024. 4

  80. [80]

    Enhancing dance- to-music generation via negative conditioning latent diffusion model

    Changchang Sun, Gaowen Liu, Charles Fleming, and Yan Yan. Enhancing dance- to-music generation via negative conditioning latent diffusion model. InCVPR,

Showing first 80 references.