AudioMoG: Guiding Audio Generation with Mixture-of-Guidance

Binjie Yuan; Chang Li; Junyou Wang; Jun Zhu; Kaiwen Zheng; Yuxuan Jiang; Zehua Chen

arxiv: 2509.23727 · v2 · submitted 2025-09-28 · 💻 cs.SD · cs.AI

AudioMoG: Guiding Audio Generation with Mixture-of-Guidance

Junyou Wang , Zehua Chen , Binjie Yuan , Kaiwen Zheng , Chang Li , Yuxuan Jiang , Jun Zhu This is my paper

Pith reviewed 2026-05-18 13:05 UTC · model grok-4.3

classification 💻 cs.SD cs.AI

keywords text-to-audiodiffusion modelsclassifier-free guidancesampling methodsaudio generationvideo-to-audiomixture of guidance

0 comments

The pith

A mixture-of-guidance sampling method outperforms single-guidance techniques in diffusion-based audio generation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to establish that blending multiple guidance signals during sampling produces higher quality outputs from diffusion models for audio than relying on any one guidance alone. Classifier-free guidance strengthens how closely the result matches a text or video prompt but tends to reduce variety in the sounds produced. Autoguidance tries to preserve that variety yet has not matched the performance of the first method in audio tasks so far. By combining the signals along with terms that reflect their differences, such as outputs from models without any conditioning, the new approach seeks to collect the best traits of each without adding training steps or slowing the process down. If the claim holds, creators could generate more faithful and varied audio for videos, music, or other media at the same computational cost.

Core claim

By integrating diverse guidance signals with their interaction terms in a mixture-of-guidance framework, the method delivers better generation quality than single-guidance baselines in text-to-audio tasks across sampling steps while also showing gains in video-to-audio, text-to-music, and image generation at identical inference speeds.

What carries the argument

Mixture-of-guidance framework that combines classifier-free guidance, autoguidance, and interaction terms such as unconditional model outputs to accumulate their respective advantages during the sampling process.

If this is right

The method produces higher quality text-to-audio results than single-guidance approaches at the same inference speed across different sampling steps.
Performance advantages appear in video-to-audio generation as well.
The benefits carry over to text-to-music and image generation tasks.
No model retraining is required to obtain the improvements.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same mixing strategy could be tested on video or 3D generation models to check whether the quality-diversity balance improves in those domains too.
One could measure whether the optimal mix of guidance weights stays stable when the underlying diffusion model changes size or training data.
Future work might examine if the interaction terms reduce the amount of manual tuning needed when moving the method to new conditioning types such as audio captions.
This framework suggests a path toward adaptive sampling that selects or weights guidance signals on the fly based on intermediate generation quality.

Load-bearing premise

Combining the different guidance signals and their interaction terms will add up their benefits without creating inconsistencies that demand heavy per-model adjustments.

What would settle it

Running the mixture-of-guidance sampler on a standard text-to-audio diffusion model and measuring no improvement in objective quality metrics over the strongest single-guidance baseline when both use the same number of sampling steps.

Figures

Figures reproduced from arXiv: 2509.23727 by Binjie Yuan, Chang Li, Junyou Wang, Jun Zhu, Kaiwen Zheng, Yuxuan Jiang, Zehua Chen.

**Figure 1.** Figure 1: Overall framework of our proposed AudioMoG, which illustrates the mechanism of AudioMoG and its degraded forms—Hierarchical Guidance exploits cumulative advantages from both methods for optimal performance, Parallel Guidance introduces complementary directions, and CFG or AG provides a single-directional guidance. in modern cross-modal audio generation systems (Liu et al., 2023; 2024a; Cheng et al., 2025).… view at source ↗

**Figure 2.** Figure 2: Illustration of guidance methods on the fractal-like 2D distribution from Karras et al. (2024a). (a) Ground truth distribution (orange class). (b) Unguided conditional sampling generates outliers. (c) Classifier-Free Guidance (w = 3) with a well-trained unconditional model struggles to remove outliers. (d) Autoguidance (w = 3) improves score estimation and removes outliers without reducing diversity. (e) P… view at source ↗

**Figure 3.** Figure 3: Performance comparison of HG and CFG-only across different NFEs. As shown, HG consistently outperforms CFG-only across all settings. objective metrics for quantitative comparison. The main results are summarized in [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Case study comparing the spectrogram outputs of different guidance strategies (CFG, PG, HG) under various text prompts. HG consistently demonstrates superior harmonic structure modeling and clearer spectral patterns compared to PG and CFG. While PG shows moderate improvements, CFG often struggles to capture harmonics and yields blurrier, less structured results, particularly for complex prompts. These exam… view at source ↗

**Figure 5.** Figure 5: Impact of different guidance scales in PG. (a)(b)(c) stands for the relations of FAD, IS and CLAP with w1 and w2, respectively. The best configurations are denoted with stars. 3.0 3.5 4.0 4.5 5.0 w1 1.4 1.6 1.8 2.0 FAD 12.25 12.50 12.75 13.00 13.25 13.50 IS (a) FAD and IS w.r.t. w1 2.50 2.75 3.00 3.25 3.50 3.75 4.00 w2 1.40 1.45 1.50 1.55 1.60 1.65 FAD 13.0 13.2 13.4 13.6 IS (b) FAD and IS w.r.t. w2 1.0 1.… view at source ↗

**Figure 6.** Figure 6: Impact of different guidance scales in HG. (a) Sweep over w1 while keeping w2 and w3 unchanged. (b) Sweep over w2 while keeping w1 and w3 unchanged. (c) Sweep over w3 while keeping w1 and w2 unchanged. The best configurations are denoted with dashed lines. B IMPACT OF GUIDANCE SCALE To further demonstrate the effectiveness of AudioMoG, we investigate the impact of different guidance scales in PG and HG on … view at source ↗

**Figure 7.** Figure 7: Screenshot of our subjective evaluations. G.2 MODEL CONFIGURATIONS We use CLIP (Radford et al., 2021) embeddings to extract the visual features, and we leverage the same VAE as in T2A. The diffusion backbone shares the same core architecture as the T2A counterpart, being built upon the DiT within the LDM paradigm. Most hyperparameters remain the same with T2A, but we increased the conditional token dimensi… view at source ↗

**Figure 8.** Figure 8: More T2A results comparing the spectrograms of the generated samples with different guidance strategies (CFG, PG, and HG) under various text prompts. The third sample is shown with a different time interval than the one presented in the main paper, and they share the same text prompt. (a) HG demonstrates the best generation quality. (b) HG produces the most temporally aligned results [PITH_FULL_IMAGE:figu… view at source ↗

**Figure 9.** Figure 9: More V2A comparing the spectrograms of the generated samples with different guidance strategies (CFG and HG) and baselines (Diff-foley and Foley-crafter). • Diff-foley models (Luo et al., 2023): Apache-2.0 license • VTA-LDM models (Xu et al., 2024): Apache-2.0 license • FoleyCrafter models (Zhang et al., 2024): Apache-2.0 license • V2A-Mapper models (Wang et al., 2024a): Creative Commons BY-NC-ND 4.0 licen… view at source ↗

read the original abstract

The design of diffusion-based audio generation systems has been investigated from diverse perspectives, such as data space, network architecture, and conditioning techniques, while most of these innovations require model re-training. In sampling, classifier-free guidance (CFG) has been uniformly adopted to enhance generation quality by strengthening condition alignment. However, CFG often compromises diversity, resulting in suboptimal performance. Although the recent autoguidance (AG) method proposes another direction of guidance that maintains diversity, its direct application in audio generation has so far underperformed CFG. In this work, we introduce AudioMoG, an improved sampling method that enhances text-to-audio (T2A) and video-to-audio (V2A) generation quality without requiring extensive training resources. We start with an analysis of both CFG and AG, examining their respective advantages and limitations for guiding diffusion models. Building upon our insights, we introduce a mixture-of-guidance framework that integrates diverse guidance signals with their interaction terms (e.g., the unconditional bad version of the model) to maximize cumulative advantages. Experiments show that, given the same inference speed, our approach consistently outperforms single guidance in T2A generation across sampling steps, concurrently showing advantages in V2A, text-to-music, and image generation. Demo samples are available at: https://audiomog.github.io.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AudioMoG mixes CFG and AG with interaction terms for improved audio diffusion sampling, but lacks detailed experimental validation.

read the letter

The one or two things to know: AudioMoG is a new sampling method for diffusion audio models that mixes classifier-free guidance, autoguidance, and their interaction terms to get better quality at no extra training cost. What the paper does well is identify the tradeoffs in current guidance techniques for audio and build a framework that tries to capture the advantages of both by including those extra terms. The reported results show it outperforming the single methods in text-to-audio generation across sampling steps while keeping inference speed the same, with some positive signs in related tasks like video-to-audio and music generation. The soft spots come down to verification. The central claim rests on empirical outperformance, but the abstract and available info don't include specifics on the datasets used, the evaluation metrics, or statistical significance like error bars. This makes it hard to assess how general the improvement is. Additionally, the approach introduces free parameters in the mixing coefficients, and it's not obvious from the description whether the interaction terms consistently provide net benefits or if they can lead to inconsistencies that demand significant per-model adjustment. This paper targets people in the audio generation community who are looking for ways to enhance existing diffusion models without heavy compute. A reader interested in practical sampling strategies would get value from the idea and the analysis. It has enough substance to deserve a serious referee, who could help clarify the experimental rigor. My recommendation is to send it to peer review rather than desk reject, with the expectation that revisions will address the details on experiments and validation.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes AudioMoG, a mixture-of-guidance sampling method for diffusion-based audio generation. It analyzes classifier-free guidance (CFG) and autoguidance (AG), identifies their respective advantages and limitations, and introduces a framework that combines multiple guidance signals along with interaction terms (explicitly including the unconditional bad model) to improve condition alignment while preserving diversity. The central empirical claim is that, at fixed inference speed, AudioMoG consistently outperforms single-guidance baselines in text-to-audio (T2A) across sampling steps and shows advantages in video-to-audio (V2A), text-to-music, and image generation tasks, all without requiring model retraining.

Significance. If the reported gains are reproducible and general, the work would supply a practical, training-free enhancement to the sampling stage of audio diffusion models. By explicitly incorporating interaction terms rather than treating CFG and AG as mutually exclusive, the approach could help resolve the quality-diversity trade-off that currently limits CFG and the underperformance of direct AG in audio domains. The extension to V2A and cross-modal tasks further suggests broader applicability.

major comments (2)

[Section 4] Section 4 (Experiments): the abstract and results claim consistent outperformance in T2A across sampling steps, yet no information is provided on the datasets employed, the precise evaluation metrics (e.g., CLAP, FAD, or subjective scores), the number of evaluation samples, or error bars from multiple random seeds. Without these details the statistical reliability of the cross-method comparison cannot be assessed and the central claim remains under-supported.
[Section 3.2] Section 3.2 (Mixture-of-Guidance formulation): the mixing rule incorporates interaction terms such as the unconditional bad model, but the manuscript does not specify whether the guidance mixing coefficients are held constant across models and datasets or require per-model/per-dataset calibration. This directly affects the claim that the method avoids extensive tuning resources and raises the possibility that reported gains depend on favorable hyper-parameter choices rather than an intrinsic property of the framework.

minor comments (2)

[Abstract] The term 'unconditional bad version of the model' appears in the abstract and method description without an immediate forward reference to its precise definition or computation; adding a brief parenthetical or citation to the relevant equation would improve readability.
[Figures] Figure captions and axis labels in the experimental plots should explicitly state the guidance scales and mixing weights used for each curve to allow direct replication of the reported comparisons.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback, which helps improve the clarity and reproducibility of our work. We provide point-by-point responses to the major comments below and have incorporated revisions to address the concerns raised.

read point-by-point responses

Referee: [Section 4] Section 4 (Experiments): the abstract and results claim consistent outperformance in T2A across sampling steps, yet no information is provided on the datasets employed, the precise evaluation metrics (e.g., CLAP, FAD, or subjective scores), the number of evaluation samples, or error bars from multiple random seeds. Without these details the statistical reliability of the cross-method comparison cannot be assessed and the central claim remains under-supported.

Authors: We agree that the initial manuscript lacked sufficient experimental details to fully substantiate the claims. In the revised version, we will expand Section 4 to explicitly describe the datasets (e.g., AudioCaps for T2A evaluations), the precise metrics including CLAP for semantic alignment, FAD for perceptual quality, and any subjective scores used. We will also report the number of evaluation samples and include error bars derived from multiple random seeds to enable assessment of statistical reliability. These additions directly strengthen support for the consistent outperformance results across sampling steps. revision: yes
Referee: [Section 3.2] Section 3.2 (Mixture-of-Guidance formulation): the mixing rule incorporates interaction terms such as the unconditional bad model, but the manuscript does not specify whether the guidance mixing coefficients are held constant across models and datasets or require per-model/per-dataset calibration. This directly affects the claim that the method avoids extensive tuning resources and raises the possibility that reported gains depend on favorable hyper-parameter choices rather than an intrinsic property of the framework.

Authors: The mixing coefficients in AudioMoG are designed to be held constant and were applied uniformly without per-model or per-dataset recalibration in all reported experiments, including T2A, V2A, text-to-music, and image generation tasks. We will revise Section 3.2 to explicitly state this fixed-coefficient approach and note the specific values used, thereby reinforcing that the method requires no extensive tuning resources beyond the standard sampling procedure. This clarification addresses the concern that gains might stem from favorable hyper-parameter choices. revision: yes

Circularity Check

0 steps flagged

No significant circularity; AudioMoG framework is empirically validated rather than self-referential

full rationale

The paper begins with an analysis of existing CFG and AG techniques, identifies their respective strengths and weaknesses for audio generation, and then proposes a new mixture-of-guidance construction that combines signals and interaction terms. The claimed outperformance is presented as an experimental result at fixed inference speed across sampling steps, not as a mathematical identity or fitted quantity that reduces to the inputs by definition. No equations or steps equate the mixture result to a self-defined quantity, and no load-bearing premise relies on a self-citation chain or imported uniqueness theorem. The derivation remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on an empirical combination of existing guidance methods; limited free parameters likely include mixing coefficients chosen to optimize results.

free parameters (1)

guidance mixing coefficients
Weights used to combine CFG, AG, and interaction signals; chosen or tuned to achieve reported performance gains.

axioms (1)

domain assumption Interaction terms from unconditional model versions provide beneficial guidance signals that can be additively combined.
Invoked when describing the mixture framework that integrates diverse signals to maximize advantages.

pith-pipeline@v0.9.0 · 5779 in / 1205 out tokens · 43003 ms · 2026-05-18T13:05:13.908698+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

AudioMoG ... mixture-of-guidance framework that integrates diverse guidance signals with their interaction terms (e.g., the unconditional bad version of the model)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

79 extracted references · 79 canonical work pages · 6 internal anchors

[1]

MusicLM: Generating Music From Text

Andrea Agostinelli, Timo I Denk, Zal \'a n Borsos, Jesse Engel, Mauro Verzetti, Antoine Caillon, Qingqing Huang, Aren Jansen, Adam Roberts, Marco Tagliasacchi, et al. Musiclm: Generating music from text. arXiv preprint arXiv:2301.11325, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

The million song dataset

Thierry Bertin-Mahieux, Daniel PW Ellis, Brian Whitman, and Paul Lamere. The million song dataset. In Ismir, volume 2, pp.\ 10, 2011

work page 2011
[3]

Classifier-free guidance is a predictor-corrector

Arwen Bradley and Preetum Nakkiran. Classifier-free guidance is a predictor-corrector. arXiv preprint arXiv:2408.09000, 2024

work page arXiv 2024
[4]

Vggsound: A large-scale audio-visual dataset

Honglie Chen, Weidi Xie, Andrea Vedaldi, and Andrew Zisserman. Vggsound: A large-scale audio-visual dataset. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.\ 721--725. IEEE, 2020

work page 2020
[5]

Mmaudio: Taming multimodal joint training for high-quality video-to-audio synthesis

Ho Kei Cheng, Masato Ishii, Akio Hayakawa, Takashi Shibuya, Alexander Schwing, and Yuki Mitsufuji. Mmaudio: Taming multimodal joint training for high-quality video-to-audio synthesis. In CVPR, pp.\ 28901--28911, 2025

work page 2025
[6]

What does guidance do? a fine-grained analysis in a simple setting

Muthu Chidambaram, Khashayar Gatmiry, Sitan Chen, Holden Lee, and Jianfeng Lu. What does guidance do? a fine-grained analysis in a simple setting. arXiv preprint arXiv:2409.13074, 2024

work page arXiv 2024
[7]

Scaling instruction-finetuned language models

Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. Scaling instruction-finetuned language models. Journal of Machine Learning Research, 25 0 (70): 0 1--53, 2024

work page 2024
[8]

Syncfusion: Multimodal onset-synchronized video-to-audio foley synthesis

Marco Comunit \`a , Riccardo F Gramaccioni, Emilian Postolache, Emanuele Rodol \`a , Danilo Comminiello, and Joshua D Reiss. Syncfusion: Multimodal onset-synchronized video-to-audio foley synthesis. In ICASSP, pp.\ 936--940. IEEE, 2024

work page 2024
[9]

Simple and controllable music generation

Jade Copet, Felix Kreuk, Itai Gat, Tal Remez, David Kant, Gabriel Synnaeve, Yossi Adi, and Alexandre D \'e fossez. Simple and controllable music generation. Advances in Neural Information Processing Systems, 36: 0 47704--47720, 2023

work page 2023
[10]

Text-to-audio generation using instruction tuned LLM and latent diffusion model,

Ghosal Deepanway, Majumder Navonil, Mehrish Ambuj, and Poria Soujanya. Text-to-audio generation using instruction-tuned llm and latent diffusion model. arXiv preprint arXiv:2304.13731, 2023

work page arXiv 2023
[11]

FMA: A Dataset For Music Analysis

Micha \"e l Defferrard, Kirell Benzi, Pierre Vandergheynst, and Xavier Bresson. Fma: A dataset for music analysis. arXiv preprint arXiv:1612.01840, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[12]

Clotho: An audio captioning dataset

Konstantinos Drossos, Samuel Lipping, and Tuomas Virtanen. Clotho: An audio captioning dataset. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.\ 736--740. IEEE, 2020

work page 2020
[13]

Conditional generation of audio from video via foley analogies

Yuexi Du, Ziyang Chen, Justin Salamon, Bryan Russell, and Andrew Owens. Conditional generation of audio from video via foley analogies. In Conference on Computer Vision and Pattern Recognition 2023, 2023

work page 2023
[14]

Fast timing-conditioned latent audio diffusion

Zach Evans, CJ Carr, Josiah Taylor, Scott H Hawley, and Jordi Pons. Fast timing-conditioned latent audio diffusion. In Forty-first International Conference on Machine Learning, 2024

work page 2024
[15]

Stable audio open

Zach Evans, Julian D Parker, CJ Carr, Zack Zukowski, Josiah Taylor, and Jordi Pons. Stable audio open. In ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.\ 1--5. IEEE, 2025

work page 2025
[16]

Fsd50k: an open dataset of human-labeled sound events

Eduardo Fonseca, Xavier Favory, Jordi Pons, Frederic Font, and Xavier Serra. Fsd50k: an open dataset of human-labeled sound events. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 30: 0 829--852, 2021

work page 2021
[17]

Riffusion-stable diffusion for real-time music generation

Seth Forsgren and Hayk Martiros. Riffusion-stable diffusion for real-time music generation. URL https://riffusion. com, 2022

work page 2022
[18]

Audio set: An ontology and human-labeled dataset for audio events

Jort F Gemmeke, Daniel PW Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R Channing Moore, Manoj Plakal, and Marvin Ritter. Audio set: An ontology and human-labeled dataset for audio events. In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp.\ 776--780. IEEE, 2017

work page 2017
[19]

Imagebind: One embedding space to bind them all

Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, and Ishan Misra. Imagebind: One embedding space to bind them all. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.\ 15180--15190, 2023

work page 2023
[20]

Gemmeke, Aren Jansen, R

Shawn Hershey, Sourish Chaudhuri, Daniel PW Ellis, Jort F. Gemmeke, Aren Jansen, R. Channing Moore, Manoj Plakal, Devin Platt, Rif A. Saurous, Bryan Seybold, Malcolm Slaney, Ron J. Weiss, and Kevin Wilson. Cnn architectures for large-scale audio classification. In Proc. ICASSP, pp.\ 131--135, 2017

work page 2017
[21]

Gans trained by a two time-scale update rule converge to a local nash equilibrium

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Proc. NeurIPS, pp.\ 6626--6637, 2017

work page 2017
[22]

Classifier-Free Diffusion Guidance

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[23]

Improving sample quality of diffusion models using self-attention guidance

Susung Hong, Gyuseong Lee, Wooseok Jang, and Seungryong Kim. Improving sample quality of diffusion models using self-attention guidance. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.\ 7462--7471, 2023

work page 2023
[24]

Make-An-Audio 2: Temporal-enhanced text-to- audio generation,

Jiawei Huang, Yi Ren, Rongjie Huang, Dongchao Yang, Zhenhui Ye, Chen Zhang, Jinglin Liu, Xiang Yin, Zejun Ma, and Zhou Zhao. Make-an-audio 2: Temporal-enhanced text-to-audio generation. arXiv preprint arXiv:2305.18474, 2023 a

work page arXiv 2023
[25]

Masked autoencoders that listen

Po-Yao Huang, Hu Xu, Juncheng Li, Alexei Baevski, Michael Auli, Wojciech Galuba, Florian Metze, and Christoph Feichtenhofer. Masked autoencoders that listen. Advances in Neural Information Processing Systems, 35: 0 28708--28720, 2022

work page 2022
[26]

Make-an-audio: Text-to-audio generation with prompt-enhanced diffusion models

Rongjie Huang, Jiawei Huang, Dongchao Yang, Yi Ren, Luping Liu, Mingze Li, Zhenhui Ye, Jinglin Liu, Xiang Yin, and Zhou Zhao. Make-an-audio: Text-to-audio generation with prompt-enhanced diffusion models. In International Conference on Machine Learning, pp.\ 13916--13932. PMLR, 2023 b

work page 2023
[27]

Tan- goflux: Super fast and faithful text to audio generation with flow matching and clap-ranked preference optimization,

Chia-Yu Hung, Navonil Majumder, Zhifeng Kong, Ambuj Mehrish, Rafael Valle, Bryan Catanzaro, and Soujanya Poria. Tangoflux: Super fast and faithful text to audio generation with flow matching and clap-ranked preference optimization. arXiv preprint arXiv:2412.21037, 2024

work page arXiv 2024
[28]

Spatiotemporal skip guidance for enhanced video diffusion sampling

Junha Hyung, Kinam Kim, Susung Hong, Min - Jung Kim, and Jaegul Choo. Spatiotemporal skip guidance for enhanced video diffusion sampling. In CVPR, 2025

work page 2025
[29]

Spg: Improving motion diffusion by smooth perturbation guidance

Boseong Jeon. Spg: Improving motion diffusion by smooth perturbation guidance. arXiv preprint arXiv:2503.02577, 2025

work page arXiv 2025
[30]

Read, watch and scream! sound generation from text and video

Yujin Jeong, Yunji Kim, Sanghyuk Chun, and Jiyoung Lee. Read, watch and scream! sound generation from text and video. arXiv preprint arXiv:2407.05551, 2024

work page arXiv 2024
[31]

Freeaudio: Training-free timing planning for controllable long-form text-to-audio generation

Yuxuan Jiang, Zehua Chen, Zeqian Ju, Chang Li, Weibei Dou, and Jun Zhu. Freeaudio: Training-free timing planning for controllable long-form text-to-audio generation. arXiv preprint arXiv:2507.08557, 2025

work page arXiv 2025
[32]

Guiding a diffusion model with a bad version of itself

Tero Karras, Miika Aittala, Tuomas Kynk \"a \"a nniemi, Jaakko Lehtinen, Timo Aila, and Samuli Laine. Guiding a diffusion model with a bad version of itself. Advances in Neural Information Processing Systems, 37: 0 52996--53021, 2024 a

work page 2024
[33]

Analyzing and improving the training dynamics of diffusion models

Tero Karras, Miika Aittala, Jaakko Lehtinen, Janne Hellsten, Timo Aila, and Samuli Laine. Analyzing and improving the training dynamics of diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 24174--24184, 2024 b

work page 2024
[34]

Autolora: Autoguidance meets low-rank adaptation for diffusion models

Artur Kasymov, Marcin Sendera, Michal Stypulkowski, Maciej Zieba, and Przemysław Spurek. Autolora: Autoguidance meets low-rank adaptation for diffusion models. ArXiv, abs/2410.03941, 2024. URL https://api.semanticscholar.org/CorpusID:273186941

work page arXiv 2024
[35]

Sim, and Paris Smaragdis

Kevin Kilgour, Robin Clark, Kyu J. Sim, and Paris Smaragdis. Fréchet audio distance: A metric for evaluating music enhancement algorithms. In Proc. Interspeech, pp.\ 2350--2354, 2019

work page 2019
[36]

Audiocaps: Generating captions for audios in the wild

Chris Dongjoo Kim, Byeongchang Kim, Hyunmin Lee, and Gunhee Kim. Audiocaps: Generating captions for audios in the wild. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp.\ 119--132, 2019

work page 2019
[37]

Auto-encoding variational bayes, 2013

Diederik P Kingma, Max Welling, et al. Auto-encoding variational bayes, 2013

work page 2013
[38]

Plumbley

Qiuqiang Kong, Yin Cao, Turab Iqbal, Yuxuan Wang, Wenwu Wang, and Mark D. Plumbley. Panns: Large-scale pretrained audio neural networks for audio pattern recognition. In IEEE/ACM Transactions on Audio, Speech, and Language Processing, volume 28, pp.\ 2880--2894, 2020

work page 2020
[39]

Improving text-to-audio models with synthetic captions

Zhifeng Kong, Sang-gil Lee, Deepanway Ghosal, Navonil Majumder, Ambuj Mehrish, Rafael Valle, Soujanya Poria, and Bryan Catanzaro. Improving text-to-audio models with synthetic captions. arXiv preprint arXiv:2406.15487, 2024

work page arXiv 2024
[40]

Audiogen: Textually guided audio generation.arXiv preprint arXiv:2209.15352,

Felix Kreuk, Gabriel Synnaeve, Adam Polyak, Uriel Singer, Alexandre D \'e fossez, Jade Copet, Devi Parikh, Yaniv Taigman, and Yossi Adi. Audiogen: Textually guided audio generation. arXiv preprint arXiv:2209.15352, 2022

work page arXiv 2022
[41]

Solomon Kullback and Richard A. Leibler. On information and sufficiency. The Annals of Mathematical Statistics, 22 0 (1): 0 79--86, 1951

work page 1951
[42]

Efficient neural music generation

Max WY Lam, Qiao Tian, Tang Li, Zongyu Yin, Siyuan Feng, Ming Tu, Yuliang Ji, Rui Xia, Mingbo Ma, Xuchen Song, et al. Efficient neural music generation. Advances in Neural Information Processing Systems, 36: 0 17450--17463, 2023

work page 2023
[43]

Evaluation of algorithms using games: The case of music tagging

Edith Law, Kris West, Michael I Mandel, Mert Bay, and J Stephen Downie. Evaluation of algorithms using games: The case of music tagging. In ISMIR, pp.\ 387--392. Citeseer, 2009

work page 2009
[44]

Etta: Elucidating the design space of text-to-audio models

Sang-gil Lee, Zhifeng Kong, Arushi Goel, Sungwon Kim, Rafael Valle, and Bryan Catanzaro. Etta: Elucidating the design space of text-to-audio models. In ICML, 2025

work page 2025
[45]

Quality-aware masked diffusion transformer for enhanced music generation

Chang Li, Ruoyu Wang, Lijuan Liu, Jun Du, Yixuan Sun, Zilu Guo, Zhenrong Zhang, and Yuan Jiang. Quality-aware masked diffusion transformer for enhanced music generation. arXiv e-prints, pp.\ arXiv--2405, 2024 a

work page 2024
[46]

Jen-1: Text-guided universal music generation with omnidirectional diffusion models

Peike Patrick Li, Boyu Chen, Yao Yao, Yikai Wang, Allen Wang, and Alex Wang. Jen-1: Text-guided universal music generation with omnidirectional diffusion models. In 2024 IEEE Conference on Artificial Intelligence (CAI), pp.\ 762--769. IEEE, 2024 b

work page 2024
[47]

Self-guidance: Boosting flow and diffusion generation on their own

Tiancheng Li, Weijian Luo, Zhiyang Chen, Liyuan Ma, and Guo-Jun Qi. Self-guidance: Boosting flow and diffusion generation on their own. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

work page 2025
[48]

Audioldm: Text-to-audio generation with latent diffusion models.arXiv preprint arXiv:2301.12503,

Haohe Liu, Zehua Chen, Zejia Yuan, et al. Audioldm: Text-to-audio generation with latent diffusion models. arXiv preprint arXiv:2301.12503, 2023

work page arXiv 2023
[49]

Audioldm 2: Learning holistic audio generation with self-supervised pretraining

Haohe Liu, Yi Yuan, Xubo Liu, Xinhao Mei, Qiuqiang Kong, Qiao Tian, Yuping Wang, Wenwu Wang, Yuxuan Wang, and Mark D Plumbley. Audioldm 2: Learning holistic audio generation with self-supervised pretraining. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2024 a

work page 2024
[50]

Tell what you hear from what you see - video to audio generation through text

Xiulong Liu, Kun Su, and Eli Shlizerman. Tell what you hear from what you see - video to audio generation through text. In NeurIPS, 2024 b . URL https://openreview.net/forum?id=kr7eN85mIT

work page 2024
[51]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[52]

DPM-Solver++: Fast Solver for Guided Sampling of Diffusion Probabilistic Models

Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. Dpm-solver++: Fast solver for guided sampling of diffusion probabilistic models. arXiv preprint arXiv:2211.01095, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[53]

Diff-foley: Synchronized video-to-audio synthesis with latent diffusion models

Simian Luo, Chuanhao Yan, Chenxu Hu, and Hang Zhao. Diff-foley: Synchronized video-to-audio synthesis with latent diffusion models. Advances in Neural Information Processing Systems, 36: 0 48855--48876, 2023

work page 2023
[54]

Tango 2: Aligning diffusion-based text-to-audio generations through direct preference optimization

Navonil Majumder, Chia-Yu Hung, Deepanway Ghosal, Wei-Ning Hsu, Rada Mihalcea, and Soujanya Poria. Tango 2: Aligning diffusion-based text-to-audio generations through direct preference optimization. In Proceedings of the 32nd ACM International Conference on Multimedia, pp.\ 564--572, 2024

work page 2024
[55]

Foleygen: Visually-guided audio generation

Xinhao Mei, Varun Nagaraja, Gael Le Lan, Zhaoheng Ni, Ernie Chang, Yangyang Shi, and Vikas Chandra. Foleygen: Visually-guided audio generation. In 2024 IEEE 34th International Workshop on Machine Learning for Signal Processing (MLSP), pp.\ 1--6. IEEE, 2024

work page 2024
[56]

Mubert-Inc. Mubert. URL https://mubert.com/, https://github.com/MubertAI/ Mubert-Text-to-Music, 2022

work page 2022
[57]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pp.\ 4195--4205, 2023

work page 2023
[58]

Unconditional priors matter! improving conditional generation of fine-tuned diffusion models

Prin Phunyaphibarn, Phillip Y Lee, Jaihoon Kim, and Minhyuk Sung. Unconditional priors matter! improving conditional generation of fine-tuned diffusion models. arXiv preprint arXiv:2503.20240, 2025

work page arXiv 2025
[59]

Learning Transferable Visual Models From Natural Language Supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision, 2021. URL https://arxiv.org/abs/2103.00020

work page internal anchor Pith review Pith/arXiv arXiv 2021
[60]

Direct preference optimization: Your language model is secretly a reward model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36: 0 53728--53741, 2023

work page 2023
[61]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj \"o rn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.\ 10684--10695, 2022

work page 2022
[62]

arXiv preprint arXiv:2407.02687 , year =

Seyedmorteza Sadat, Manuel Kansy, Otmar Hilliges, and Romann M Weber. No training, no problem: Rethinking classifier-free guidance for diffusion models. arXiv preprint arXiv:2407.02687, 2024

work page arXiv 2024
[63]

Improved techniques for training gans

Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. In Proc. NeurIPS, pp.\ 2234--2242, 2016

work page 2016
[64]

Mo \^ usai: Text-to-music generation with long-context latent diffusion

Flavio Schneider, Ojasv Kamal, Zhijing Jin, and Bernhard Sch \"o lkopf. Mo \^ usai: Text-to-music generation with long-context latent diffusion. arXiv preprint arXiv:2301.11757, 2023

work page arXiv 2023
[65]

I hear your true colors: Image guided audio generation

Roy Sheffer and Yossi Adi. I hear your true colors: Image guided audio generation. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.\ 1--5. IEEE, 2023

work page 2023
[66]

From vision to audio and beyond: A unified model for audio-visual representation and generation

Kun Su, Xiulong Liu, and Eli Shlizerman. From vision to audio and beyond: A unified model for audio-visual representation and generation. arXiv preprint arXiv:2409.19132, 2024

work page arXiv 2024
[67]

Neural discrete representation learning

Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete representation learning. Advances in neural information processing systems, 30, 2017

work page 2017
[68]

V2a-mapper: A lightweight solution for vision-to-audio generation by connecting foundation models

Heng Wang, Jianbo Ma, Santiago Pascual, Richard Cartwright, and Weidong Cai. V2a-mapper: A lightweight solution for vision-to-audio generation by connecting foundation models. In AAAI, volume 38, pp.\ 15492--15501, 2024 a

work page 2024
[69]

Tiva: Time-aligned video-to-audio generation

Xihua Wang, Yuyue Wang, Yihan Wu, Ruihua Song, Xu Tan, Zehua Chen, Hongteng Xu, and Guodong Sui. Tiva: Time-aligned video-to-audio generation. In ACM Multimedia, 2024 b

work page 2024
[70]

Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation

Yusong Wu, Ke Chen, Tianyu Zhang, Yuchen Hui, Taylor Berg-Kirkpatrick, and Shlomo Dubnov. Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation. In ICASSP, pp.\ 1--5. IEEE, 2023

work page 2023
[71]

Sonicvisionlm: Playing sound with vision language models

Zhifeng Xie, Shengye Yu, Qile He, and Mengtian Li. Sonicvisionlm: Playing sound with vision language models. In CVPR, pp.\ 26866--26875, 2024

work page 2024
[72]

Video-to-audio generation with hidden alignment

Manjie Xu, Chenxing Li, Yong Ren, Rilin Chen, Yu Gu, Wei Liang, and Dong Yu. Video-to-audio generation with hidden alignment. arXiv preprint arXiv:2407.07464, 2024

work page arXiv 2024
[73]

Diffsound: Discrete diffusion model for text-to-sound generation

Dongchao Yang, Jianwei Yu, Helin Wang, Wen Wang, Chao Weng, Yuexian Zou, and Dong Yu. Diffsound: Discrete diffusion model for text-to-sound generation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 31: 0 1720--1733, 2023

work page 2023
[74]

Foley- crafter: Bring silent videos to life with lifelike and synchro- nized sounds,

Yiming Zhang, Yicheng Gu, Yanhong Zeng, Zhening Xing, Yuancheng Wang, Zhizheng Wu, and Kai Chen. Foleycrafter: Bring silent videos to life with lifelike and synchronized sounds. arXiv preprint arXiv:2407.01494, 2024

work page arXiv 2024
[75]

Domain guidance: A simple transfer approach for a pre-trained diffusion model

Jincheng Zhong, Xiangcheng Zhang, Jianmin Wang, and Mingsheng Long. Domain guidance: A simple transfer approach for a pre-trained diffusion model. arXiv preprint arXiv:2504.01521, 2025

work page arXiv 2025
[76]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page
[77]

@esa (Ref

\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

work page
[78]

\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

work page
[79]

Wڵk䤦455e ZJVիu-_h `Y(

@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...

work page 2022

[1] [1]

MusicLM: Generating Music From Text

Andrea Agostinelli, Timo I Denk, Zal \'a n Borsos, Jesse Engel, Mauro Verzetti, Antoine Caillon, Qingqing Huang, Aren Jansen, Adam Roberts, Marco Tagliasacchi, et al. Musiclm: Generating music from text. arXiv preprint arXiv:2301.11325, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [2]

The million song dataset

Thierry Bertin-Mahieux, Daniel PW Ellis, Brian Whitman, and Paul Lamere. The million song dataset. In Ismir, volume 2, pp.\ 10, 2011

work page 2011

[3] [3]

Classifier-free guidance is a predictor-corrector

Arwen Bradley and Preetum Nakkiran. Classifier-free guidance is a predictor-corrector. arXiv preprint arXiv:2408.09000, 2024

work page arXiv 2024

[4] [4]

Vggsound: A large-scale audio-visual dataset

Honglie Chen, Weidi Xie, Andrea Vedaldi, and Andrew Zisserman. Vggsound: A large-scale audio-visual dataset. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.\ 721--725. IEEE, 2020

work page 2020

[5] [5]

Mmaudio: Taming multimodal joint training for high-quality video-to-audio synthesis

Ho Kei Cheng, Masato Ishii, Akio Hayakawa, Takashi Shibuya, Alexander Schwing, and Yuki Mitsufuji. Mmaudio: Taming multimodal joint training for high-quality video-to-audio synthesis. In CVPR, pp.\ 28901--28911, 2025

work page 2025

[6] [6]

What does guidance do? a fine-grained analysis in a simple setting

Muthu Chidambaram, Khashayar Gatmiry, Sitan Chen, Holden Lee, and Jianfeng Lu. What does guidance do? a fine-grained analysis in a simple setting. arXiv preprint arXiv:2409.13074, 2024

work page arXiv 2024

[7] [7]

Scaling instruction-finetuned language models

Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. Scaling instruction-finetuned language models. Journal of Machine Learning Research, 25 0 (70): 0 1--53, 2024

work page 2024

[8] [8]

Syncfusion: Multimodal onset-synchronized video-to-audio foley synthesis

Marco Comunit \`a , Riccardo F Gramaccioni, Emilian Postolache, Emanuele Rodol \`a , Danilo Comminiello, and Joshua D Reiss. Syncfusion: Multimodal onset-synchronized video-to-audio foley synthesis. In ICASSP, pp.\ 936--940. IEEE, 2024

work page 2024

[9] [9]

Simple and controllable music generation

Jade Copet, Felix Kreuk, Itai Gat, Tal Remez, David Kant, Gabriel Synnaeve, Yossi Adi, and Alexandre D \'e fossez. Simple and controllable music generation. Advances in Neural Information Processing Systems, 36: 0 47704--47720, 2023

work page 2023

[10] [10]

Text-to-audio generation using instruction tuned LLM and latent diffusion model,

Ghosal Deepanway, Majumder Navonil, Mehrish Ambuj, and Poria Soujanya. Text-to-audio generation using instruction-tuned llm and latent diffusion model. arXiv preprint arXiv:2304.13731, 2023

work page arXiv 2023

[11] [11]

FMA: A Dataset For Music Analysis

Micha \"e l Defferrard, Kirell Benzi, Pierre Vandergheynst, and Xavier Bresson. Fma: A dataset for music analysis. arXiv preprint arXiv:1612.01840, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[12] [12]

Clotho: An audio captioning dataset

Konstantinos Drossos, Samuel Lipping, and Tuomas Virtanen. Clotho: An audio captioning dataset. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.\ 736--740. IEEE, 2020

work page 2020

[13] [13]

Conditional generation of audio from video via foley analogies

Yuexi Du, Ziyang Chen, Justin Salamon, Bryan Russell, and Andrew Owens. Conditional generation of audio from video via foley analogies. In Conference on Computer Vision and Pattern Recognition 2023, 2023

work page 2023

[14] [14]

Fast timing-conditioned latent audio diffusion

Zach Evans, CJ Carr, Josiah Taylor, Scott H Hawley, and Jordi Pons. Fast timing-conditioned latent audio diffusion. In Forty-first International Conference on Machine Learning, 2024

work page 2024

[15] [15]

Stable audio open

Zach Evans, Julian D Parker, CJ Carr, Zack Zukowski, Josiah Taylor, and Jordi Pons. Stable audio open. In ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.\ 1--5. IEEE, 2025

work page 2025

[16] [16]

Fsd50k: an open dataset of human-labeled sound events

Eduardo Fonseca, Xavier Favory, Jordi Pons, Frederic Font, and Xavier Serra. Fsd50k: an open dataset of human-labeled sound events. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 30: 0 829--852, 2021

work page 2021

[17] [17]

Riffusion-stable diffusion for real-time music generation

Seth Forsgren and Hayk Martiros. Riffusion-stable diffusion for real-time music generation. URL https://riffusion. com, 2022

work page 2022

[18] [18]

Audio set: An ontology and human-labeled dataset for audio events

Jort F Gemmeke, Daniel PW Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R Channing Moore, Manoj Plakal, and Marvin Ritter. Audio set: An ontology and human-labeled dataset for audio events. In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp.\ 776--780. IEEE, 2017

work page 2017

[19] [19]

Imagebind: One embedding space to bind them all

Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, and Ishan Misra. Imagebind: One embedding space to bind them all. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.\ 15180--15190, 2023

work page 2023

[20] [20]

Gemmeke, Aren Jansen, R

Shawn Hershey, Sourish Chaudhuri, Daniel PW Ellis, Jort F. Gemmeke, Aren Jansen, R. Channing Moore, Manoj Plakal, Devin Platt, Rif A. Saurous, Bryan Seybold, Malcolm Slaney, Ron J. Weiss, and Kevin Wilson. Cnn architectures for large-scale audio classification. In Proc. ICASSP, pp.\ 131--135, 2017

work page 2017

[21] [21]

Gans trained by a two time-scale update rule converge to a local nash equilibrium

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Proc. NeurIPS, pp.\ 6626--6637, 2017

work page 2017

[22] [22]

Classifier-Free Diffusion Guidance

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[23] [23]

Improving sample quality of diffusion models using self-attention guidance

Susung Hong, Gyuseong Lee, Wooseok Jang, and Seungryong Kim. Improving sample quality of diffusion models using self-attention guidance. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.\ 7462--7471, 2023

work page 2023

[24] [24]

Make-An-Audio 2: Temporal-enhanced text-to- audio generation,

Jiawei Huang, Yi Ren, Rongjie Huang, Dongchao Yang, Zhenhui Ye, Chen Zhang, Jinglin Liu, Xiang Yin, Zejun Ma, and Zhou Zhao. Make-an-audio 2: Temporal-enhanced text-to-audio generation. arXiv preprint arXiv:2305.18474, 2023 a

work page arXiv 2023

[25] [25]

Masked autoencoders that listen

Po-Yao Huang, Hu Xu, Juncheng Li, Alexei Baevski, Michael Auli, Wojciech Galuba, Florian Metze, and Christoph Feichtenhofer. Masked autoencoders that listen. Advances in Neural Information Processing Systems, 35: 0 28708--28720, 2022

work page 2022

[26] [26]

Make-an-audio: Text-to-audio generation with prompt-enhanced diffusion models

Rongjie Huang, Jiawei Huang, Dongchao Yang, Yi Ren, Luping Liu, Mingze Li, Zhenhui Ye, Jinglin Liu, Xiang Yin, and Zhou Zhao. Make-an-audio: Text-to-audio generation with prompt-enhanced diffusion models. In International Conference on Machine Learning, pp.\ 13916--13932. PMLR, 2023 b

work page 2023

[27] [27]

Tan- goflux: Super fast and faithful text to audio generation with flow matching and clap-ranked preference optimization,

Chia-Yu Hung, Navonil Majumder, Zhifeng Kong, Ambuj Mehrish, Rafael Valle, Bryan Catanzaro, and Soujanya Poria. Tangoflux: Super fast and faithful text to audio generation with flow matching and clap-ranked preference optimization. arXiv preprint arXiv:2412.21037, 2024

work page arXiv 2024

[28] [28]

Spatiotemporal skip guidance for enhanced video diffusion sampling

Junha Hyung, Kinam Kim, Susung Hong, Min - Jung Kim, and Jaegul Choo. Spatiotemporal skip guidance for enhanced video diffusion sampling. In CVPR, 2025

work page 2025

[29] [29]

Spg: Improving motion diffusion by smooth perturbation guidance

Boseong Jeon. Spg: Improving motion diffusion by smooth perturbation guidance. arXiv preprint arXiv:2503.02577, 2025

work page arXiv 2025

[30] [30]

Read, watch and scream! sound generation from text and video

Yujin Jeong, Yunji Kim, Sanghyuk Chun, and Jiyoung Lee. Read, watch and scream! sound generation from text and video. arXiv preprint arXiv:2407.05551, 2024

work page arXiv 2024

[31] [31]

Freeaudio: Training-free timing planning for controllable long-form text-to-audio generation

Yuxuan Jiang, Zehua Chen, Zeqian Ju, Chang Li, Weibei Dou, and Jun Zhu. Freeaudio: Training-free timing planning for controllable long-form text-to-audio generation. arXiv preprint arXiv:2507.08557, 2025

work page arXiv 2025

[32] [32]

Guiding a diffusion model with a bad version of itself

Tero Karras, Miika Aittala, Tuomas Kynk \"a \"a nniemi, Jaakko Lehtinen, Timo Aila, and Samuli Laine. Guiding a diffusion model with a bad version of itself. Advances in Neural Information Processing Systems, 37: 0 52996--53021, 2024 a

work page 2024

[33] [33]

Analyzing and improving the training dynamics of diffusion models

Tero Karras, Miika Aittala, Jaakko Lehtinen, Janne Hellsten, Timo Aila, and Samuli Laine. Analyzing and improving the training dynamics of diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 24174--24184, 2024 b

work page 2024

[34] [34]

Autolora: Autoguidance meets low-rank adaptation for diffusion models

Artur Kasymov, Marcin Sendera, Michal Stypulkowski, Maciej Zieba, and Przemysław Spurek. Autolora: Autoguidance meets low-rank adaptation for diffusion models. ArXiv, abs/2410.03941, 2024. URL https://api.semanticscholar.org/CorpusID:273186941

work page arXiv 2024

[35] [35]

Sim, and Paris Smaragdis

Kevin Kilgour, Robin Clark, Kyu J. Sim, and Paris Smaragdis. Fréchet audio distance: A metric for evaluating music enhancement algorithms. In Proc. Interspeech, pp.\ 2350--2354, 2019

work page 2019

[36] [36]

Audiocaps: Generating captions for audios in the wild

Chris Dongjoo Kim, Byeongchang Kim, Hyunmin Lee, and Gunhee Kim. Audiocaps: Generating captions for audios in the wild. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp.\ 119--132, 2019

work page 2019

[37] [37]

Auto-encoding variational bayes, 2013

Diederik P Kingma, Max Welling, et al. Auto-encoding variational bayes, 2013

work page 2013

[38] [38]

Plumbley

Qiuqiang Kong, Yin Cao, Turab Iqbal, Yuxuan Wang, Wenwu Wang, and Mark D. Plumbley. Panns: Large-scale pretrained audio neural networks for audio pattern recognition. In IEEE/ACM Transactions on Audio, Speech, and Language Processing, volume 28, pp.\ 2880--2894, 2020

work page 2020

[39] [39]

Improving text-to-audio models with synthetic captions

Zhifeng Kong, Sang-gil Lee, Deepanway Ghosal, Navonil Majumder, Ambuj Mehrish, Rafael Valle, Soujanya Poria, and Bryan Catanzaro. Improving text-to-audio models with synthetic captions. arXiv preprint arXiv:2406.15487, 2024

work page arXiv 2024

[40] [40]

Audiogen: Textually guided audio generation.arXiv preprint arXiv:2209.15352,

Felix Kreuk, Gabriel Synnaeve, Adam Polyak, Uriel Singer, Alexandre D \'e fossez, Jade Copet, Devi Parikh, Yaniv Taigman, and Yossi Adi. Audiogen: Textually guided audio generation. arXiv preprint arXiv:2209.15352, 2022

work page arXiv 2022

[41] [41]

Solomon Kullback and Richard A. Leibler. On information and sufficiency. The Annals of Mathematical Statistics, 22 0 (1): 0 79--86, 1951

work page 1951

[42] [42]

Efficient neural music generation

Max WY Lam, Qiao Tian, Tang Li, Zongyu Yin, Siyuan Feng, Ming Tu, Yuliang Ji, Rui Xia, Mingbo Ma, Xuchen Song, et al. Efficient neural music generation. Advances in Neural Information Processing Systems, 36: 0 17450--17463, 2023

work page 2023

[43] [43]

Evaluation of algorithms using games: The case of music tagging

Edith Law, Kris West, Michael I Mandel, Mert Bay, and J Stephen Downie. Evaluation of algorithms using games: The case of music tagging. In ISMIR, pp.\ 387--392. Citeseer, 2009

work page 2009

[44] [44]

Etta: Elucidating the design space of text-to-audio models

Sang-gil Lee, Zhifeng Kong, Arushi Goel, Sungwon Kim, Rafael Valle, and Bryan Catanzaro. Etta: Elucidating the design space of text-to-audio models. In ICML, 2025

work page 2025

[45] [45]

Quality-aware masked diffusion transformer for enhanced music generation

Chang Li, Ruoyu Wang, Lijuan Liu, Jun Du, Yixuan Sun, Zilu Guo, Zhenrong Zhang, and Yuan Jiang. Quality-aware masked diffusion transformer for enhanced music generation. arXiv e-prints, pp.\ arXiv--2405, 2024 a

work page 2024

[46] [46]

Jen-1: Text-guided universal music generation with omnidirectional diffusion models

Peike Patrick Li, Boyu Chen, Yao Yao, Yikai Wang, Allen Wang, and Alex Wang. Jen-1: Text-guided universal music generation with omnidirectional diffusion models. In 2024 IEEE Conference on Artificial Intelligence (CAI), pp.\ 762--769. IEEE, 2024 b

work page 2024

[47] [47]

Self-guidance: Boosting flow and diffusion generation on their own

Tiancheng Li, Weijian Luo, Zhiyang Chen, Liyuan Ma, and Guo-Jun Qi. Self-guidance: Boosting flow and diffusion generation on their own. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

work page 2025

[48] [48]

Audioldm: Text-to-audio generation with latent diffusion models.arXiv preprint arXiv:2301.12503,

Haohe Liu, Zehua Chen, Zejia Yuan, et al. Audioldm: Text-to-audio generation with latent diffusion models. arXiv preprint arXiv:2301.12503, 2023

work page arXiv 2023

[49] [49]

Audioldm 2: Learning holistic audio generation with self-supervised pretraining

Haohe Liu, Yi Yuan, Xubo Liu, Xinhao Mei, Qiuqiang Kong, Qiao Tian, Yuping Wang, Wenwu Wang, Yuxuan Wang, and Mark D Plumbley. Audioldm 2: Learning holistic audio generation with self-supervised pretraining. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2024 a

work page 2024

[50] [50]

Tell what you hear from what you see - video to audio generation through text

Xiulong Liu, Kun Su, and Eli Shlizerman. Tell what you hear from what you see - video to audio generation through text. In NeurIPS, 2024 b . URL https://openreview.net/forum?id=kr7eN85mIT

work page 2024

[51] [51]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[52] [52]

DPM-Solver++: Fast Solver for Guided Sampling of Diffusion Probabilistic Models

Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. Dpm-solver++: Fast solver for guided sampling of diffusion probabilistic models. arXiv preprint arXiv:2211.01095, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[53] [53]

Diff-foley: Synchronized video-to-audio synthesis with latent diffusion models

Simian Luo, Chuanhao Yan, Chenxu Hu, and Hang Zhao. Diff-foley: Synchronized video-to-audio synthesis with latent diffusion models. Advances in Neural Information Processing Systems, 36: 0 48855--48876, 2023

work page 2023

[54] [54]

Tango 2: Aligning diffusion-based text-to-audio generations through direct preference optimization

Navonil Majumder, Chia-Yu Hung, Deepanway Ghosal, Wei-Ning Hsu, Rada Mihalcea, and Soujanya Poria. Tango 2: Aligning diffusion-based text-to-audio generations through direct preference optimization. In Proceedings of the 32nd ACM International Conference on Multimedia, pp.\ 564--572, 2024

work page 2024

[55] [55]

Foleygen: Visually-guided audio generation

Xinhao Mei, Varun Nagaraja, Gael Le Lan, Zhaoheng Ni, Ernie Chang, Yangyang Shi, and Vikas Chandra. Foleygen: Visually-guided audio generation. In 2024 IEEE 34th International Workshop on Machine Learning for Signal Processing (MLSP), pp.\ 1--6. IEEE, 2024

work page 2024

[56] [56]

Mubert-Inc. Mubert. URL https://mubert.com/, https://github.com/MubertAI/ Mubert-Text-to-Music, 2022

work page 2022

[57] [57]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pp.\ 4195--4205, 2023

work page 2023

[58] [58]

Unconditional priors matter! improving conditional generation of fine-tuned diffusion models

Prin Phunyaphibarn, Phillip Y Lee, Jaihoon Kim, and Minhyuk Sung. Unconditional priors matter! improving conditional generation of fine-tuned diffusion models. arXiv preprint arXiv:2503.20240, 2025

work page arXiv 2025

[59] [59]

Learning Transferable Visual Models From Natural Language Supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision, 2021. URL https://arxiv.org/abs/2103.00020

work page internal anchor Pith review Pith/arXiv arXiv 2021

[60] [60]

Direct preference optimization: Your language model is secretly a reward model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36: 0 53728--53741, 2023

work page 2023

[61] [61]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj \"o rn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.\ 10684--10695, 2022

work page 2022

[62] [62]

arXiv preprint arXiv:2407.02687 , year =

Seyedmorteza Sadat, Manuel Kansy, Otmar Hilliges, and Romann M Weber. No training, no problem: Rethinking classifier-free guidance for diffusion models. arXiv preprint arXiv:2407.02687, 2024

work page arXiv 2024

[63] [63]

Improved techniques for training gans

Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. In Proc. NeurIPS, pp.\ 2234--2242, 2016

work page 2016

[64] [64]

Mo \^ usai: Text-to-music generation with long-context latent diffusion

Flavio Schneider, Ojasv Kamal, Zhijing Jin, and Bernhard Sch \"o lkopf. Mo \^ usai: Text-to-music generation with long-context latent diffusion. arXiv preprint arXiv:2301.11757, 2023

work page arXiv 2023

[65] [65]

I hear your true colors: Image guided audio generation

Roy Sheffer and Yossi Adi. I hear your true colors: Image guided audio generation. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.\ 1--5. IEEE, 2023

work page 2023

[66] [66]

From vision to audio and beyond: A unified model for audio-visual representation and generation

Kun Su, Xiulong Liu, and Eli Shlizerman. From vision to audio and beyond: A unified model for audio-visual representation and generation. arXiv preprint arXiv:2409.19132, 2024

work page arXiv 2024

[67] [67]

Neural discrete representation learning

Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete representation learning. Advances in neural information processing systems, 30, 2017

work page 2017

[68] [68]

V2a-mapper: A lightweight solution for vision-to-audio generation by connecting foundation models

Heng Wang, Jianbo Ma, Santiago Pascual, Richard Cartwright, and Weidong Cai. V2a-mapper: A lightweight solution for vision-to-audio generation by connecting foundation models. In AAAI, volume 38, pp.\ 15492--15501, 2024 a

work page 2024

[69] [69]

Tiva: Time-aligned video-to-audio generation

Xihua Wang, Yuyue Wang, Yihan Wu, Ruihua Song, Xu Tan, Zehua Chen, Hongteng Xu, and Guodong Sui. Tiva: Time-aligned video-to-audio generation. In ACM Multimedia, 2024 b

work page 2024

[70] [70]

Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation

Yusong Wu, Ke Chen, Tianyu Zhang, Yuchen Hui, Taylor Berg-Kirkpatrick, and Shlomo Dubnov. Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation. In ICASSP, pp.\ 1--5. IEEE, 2023

work page 2023

[71] [71]

Sonicvisionlm: Playing sound with vision language models

Zhifeng Xie, Shengye Yu, Qile He, and Mengtian Li. Sonicvisionlm: Playing sound with vision language models. In CVPR, pp.\ 26866--26875, 2024

work page 2024

[72] [72]

Video-to-audio generation with hidden alignment

Manjie Xu, Chenxing Li, Yong Ren, Rilin Chen, Yu Gu, Wei Liang, and Dong Yu. Video-to-audio generation with hidden alignment. arXiv preprint arXiv:2407.07464, 2024

work page arXiv 2024

[73] [73]

Diffsound: Discrete diffusion model for text-to-sound generation

Dongchao Yang, Jianwei Yu, Helin Wang, Wen Wang, Chao Weng, Yuexian Zou, and Dong Yu. Diffsound: Discrete diffusion model for text-to-sound generation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 31: 0 1720--1733, 2023

work page 2023

[74] [74]

Foley- crafter: Bring silent videos to life with lifelike and synchro- nized sounds,

Yiming Zhang, Yicheng Gu, Yanhong Zeng, Zhening Xing, Yuancheng Wang, Zhizheng Wu, and Kai Chen. Foleycrafter: Bring silent videos to life with lifelike and synchronized sounds. arXiv preprint arXiv:2407.01494, 2024

work page arXiv 2024

[75] [75]

Domain guidance: A simple transfer approach for a pre-trained diffusion model

Jincheng Zhong, Xiangcheng Zhang, Jianmin Wang, and Mingsheng Long. Domain guidance: A simple transfer approach for a pre-trained diffusion model. arXiv preprint arXiv:2504.01521, 2025

work page arXiv 2025

[76] [76]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page

[77] [77]

@esa (Ref

\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

work page

[78] [78]

\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

work page

[79] [79]

Wڵk䤦455e ZJVիu-_h `Y(

@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...

work page 2022