pith. sign in

arxiv: 2509.23727 · v2 · submitted 2025-09-28 · 💻 cs.SD · cs.AI

AudioMoG: Guiding Audio Generation with Mixture-of-Guidance

Pith reviewed 2026-05-18 13:05 UTC · model grok-4.3

classification 💻 cs.SD cs.AI
keywords text-to-audiodiffusion modelsclassifier-free guidancesampling methodsaudio generationvideo-to-audiomixture of guidance
0
0 comments X

The pith

A mixture-of-guidance sampling method outperforms single-guidance techniques in diffusion-based audio generation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to establish that blending multiple guidance signals during sampling produces higher quality outputs from diffusion models for audio than relying on any one guidance alone. Classifier-free guidance strengthens how closely the result matches a text or video prompt but tends to reduce variety in the sounds produced. Autoguidance tries to preserve that variety yet has not matched the performance of the first method in audio tasks so far. By combining the signals along with terms that reflect their differences, such as outputs from models without any conditioning, the new approach seeks to collect the best traits of each without adding training steps or slowing the process down. If the claim holds, creators could generate more faithful and varied audio for videos, music, or other media at the same computational cost.

Core claim

By integrating diverse guidance signals with their interaction terms in a mixture-of-guidance framework, the method delivers better generation quality than single-guidance baselines in text-to-audio tasks across sampling steps while also showing gains in video-to-audio, text-to-music, and image generation at identical inference speeds.

What carries the argument

Mixture-of-guidance framework that combines classifier-free guidance, autoguidance, and interaction terms such as unconditional model outputs to accumulate their respective advantages during the sampling process.

If this is right

  • The method produces higher quality text-to-audio results than single-guidance approaches at the same inference speed across different sampling steps.
  • Performance advantages appear in video-to-audio generation as well.
  • The benefits carry over to text-to-music and image generation tasks.
  • No model retraining is required to obtain the improvements.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same mixing strategy could be tested on video or 3D generation models to check whether the quality-diversity balance improves in those domains too.
  • One could measure whether the optimal mix of guidance weights stays stable when the underlying diffusion model changes size or training data.
  • Future work might examine if the interaction terms reduce the amount of manual tuning needed when moving the method to new conditioning types such as audio captions.
  • This framework suggests a path toward adaptive sampling that selects or weights guidance signals on the fly based on intermediate generation quality.

Load-bearing premise

Combining the different guidance signals and their interaction terms will add up their benefits without creating inconsistencies that demand heavy per-model adjustments.

What would settle it

Running the mixture-of-guidance sampler on a standard text-to-audio diffusion model and measuring no improvement in objective quality metrics over the strongest single-guidance baseline when both use the same number of sampling steps.

Figures

Figures reproduced from arXiv: 2509.23727 by Binjie Yuan, Chang Li, Junyou Wang, Jun Zhu, Kaiwen Zheng, Yuxuan Jiang, Zehua Chen.

Figure 1
Figure 1. Figure 1: Overall framework of our proposed AudioMoG, which illustrates the mechanism of AudioMoG and its degraded forms—Hierarchical Guidance exploits cumulative advantages from both methods for optimal performance, Parallel Guidance introduces complementary directions, and CFG or AG provides a single-directional guidance. in modern cross-modal audio generation systems (Liu et al., 2023; 2024a; Cheng et al., 2025).… view at source ↗
Figure 2
Figure 2. Figure 2: Illustration of guidance methods on the fractal-like 2D distribution from Karras et al. (2024a). (a) Ground truth distribution (orange class). (b) Unguided conditional sampling generates outliers. (c) Classifier-Free Guidance (w = 3) with a well-trained unconditional model struggles to remove outliers. (d) Autoguidance (w = 3) improves score estimation and removes outliers without reducing diversity. (e) P… view at source ↗
Figure 3
Figure 3. Figure 3: Performance comparison of HG and CFG-only across different NFEs. As shown, HG consistently outperforms CFG-only across all settings. objective metrics for quantitative comparison. The main results are summarized in [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Case study comparing the spectrogram outputs of different guidance strategies (CFG, PG, HG) under various text prompts. HG consistently demonstrates superior harmonic structure modeling and clearer spectral patterns compared to PG and CFG. While PG shows moderate improvements, CFG often struggles to capture harmonics and yields blurrier, less structured results, particularly for complex prompts. These exam… view at source ↗
Figure 5
Figure 5. Figure 5: Impact of different guidance scales in PG. (a)(b)(c) stands for the relations of FAD, IS and CLAP with w1 and w2, respectively. The best configurations are denoted with stars. 3.0 3.5 4.0 4.5 5.0 w1 1.4 1.6 1.8 2.0 FAD 12.25 12.50 12.75 13.00 13.25 13.50 IS (a) FAD and IS w.r.t. w1 2.50 2.75 3.00 3.25 3.50 3.75 4.00 w2 1.40 1.45 1.50 1.55 1.60 1.65 FAD 13.0 13.2 13.4 13.6 IS (b) FAD and IS w.r.t. w2 1.0 1.… view at source ↗
Figure 6
Figure 6. Figure 6: Impact of different guidance scales in HG. (a) Sweep over w1 while keeping w2 and w3 unchanged. (b) Sweep over w2 while keeping w1 and w3 unchanged. (c) Sweep over w3 while keeping w1 and w2 unchanged. The best configurations are denoted with dashed lines. B IMPACT OF GUIDANCE SCALE To further demonstrate the effectiveness of AudioMoG, we investigate the impact of different guidance scales in PG and HG on … view at source ↗
Figure 7
Figure 7. Figure 7: Screenshot of our subjective evaluations. G.2 MODEL CONFIGURATIONS We use CLIP (Radford et al., 2021) embeddings to extract the visual features, and we leverage the same VAE as in T2A. The diffusion backbone shares the same core architecture as the T2A counterpart, being built upon the DiT within the LDM paradigm. Most hyperparameters remain the same with T2A, but we increased the conditional token dimensi… view at source ↗
Figure 8
Figure 8. Figure 8: More T2A results comparing the spectrograms of the generated samples with different guidance strategies (CFG, PG, and HG) under various text prompts. The third sample is shown with a different time interval than the one presented in the main paper, and they share the same text prompt. (a) HG demonstrates the best generation quality. (b) HG produces the most temporally aligned results [PITH_FULL_IMAGE:figu… view at source ↗
Figure 9
Figure 9. Figure 9: More V2A comparing the spectrograms of the generated samples with different guidance strategies (CFG and HG) and baselines (Diff-foley and Foley-crafter). • Diff-foley models (Luo et al., 2023): Apache-2.0 license • VTA-LDM models (Xu et al., 2024): Apache-2.0 license • FoleyCrafter models (Zhang et al., 2024): Apache-2.0 license • V2A-Mapper models (Wang et al., 2024a): Creative Commons BY-NC-ND 4.0 licen… view at source ↗
read the original abstract

The design of diffusion-based audio generation systems has been investigated from diverse perspectives, such as data space, network architecture, and conditioning techniques, while most of these innovations require model re-training. In sampling, classifier-free guidance (CFG) has been uniformly adopted to enhance generation quality by strengthening condition alignment. However, CFG often compromises diversity, resulting in suboptimal performance. Although the recent autoguidance (AG) method proposes another direction of guidance that maintains diversity, its direct application in audio generation has so far underperformed CFG. In this work, we introduce AudioMoG, an improved sampling method that enhances text-to-audio (T2A) and video-to-audio (V2A) generation quality without requiring extensive training resources. We start with an analysis of both CFG and AG, examining their respective advantages and limitations for guiding diffusion models. Building upon our insights, we introduce a mixture-of-guidance framework that integrates diverse guidance signals with their interaction terms (e.g., the unconditional bad version of the model) to maximize cumulative advantages. Experiments show that, given the same inference speed, our approach consistently outperforms single guidance in T2A generation across sampling steps, concurrently showing advantages in V2A, text-to-music, and image generation. Demo samples are available at: https://audiomog.github.io.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes AudioMoG, a mixture-of-guidance sampling method for diffusion-based audio generation. It analyzes classifier-free guidance (CFG) and autoguidance (AG), identifies their respective advantages and limitations, and introduces a framework that combines multiple guidance signals along with interaction terms (explicitly including the unconditional bad model) to improve condition alignment while preserving diversity. The central empirical claim is that, at fixed inference speed, AudioMoG consistently outperforms single-guidance baselines in text-to-audio (T2A) across sampling steps and shows advantages in video-to-audio (V2A), text-to-music, and image generation tasks, all without requiring model retraining.

Significance. If the reported gains are reproducible and general, the work would supply a practical, training-free enhancement to the sampling stage of audio diffusion models. By explicitly incorporating interaction terms rather than treating CFG and AG as mutually exclusive, the approach could help resolve the quality-diversity trade-off that currently limits CFG and the underperformance of direct AG in audio domains. The extension to V2A and cross-modal tasks further suggests broader applicability.

major comments (2)
  1. [Section 4] Section 4 (Experiments): the abstract and results claim consistent outperformance in T2A across sampling steps, yet no information is provided on the datasets employed, the precise evaluation metrics (e.g., CLAP, FAD, or subjective scores), the number of evaluation samples, or error bars from multiple random seeds. Without these details the statistical reliability of the cross-method comparison cannot be assessed and the central claim remains under-supported.
  2. [Section 3.2] Section 3.2 (Mixture-of-Guidance formulation): the mixing rule incorporates interaction terms such as the unconditional bad model, but the manuscript does not specify whether the guidance mixing coefficients are held constant across models and datasets or require per-model/per-dataset calibration. This directly affects the claim that the method avoids extensive tuning resources and raises the possibility that reported gains depend on favorable hyper-parameter choices rather than an intrinsic property of the framework.
minor comments (2)
  1. [Abstract] The term 'unconditional bad version of the model' appears in the abstract and method description without an immediate forward reference to its precise definition or computation; adding a brief parenthetical or citation to the relevant equation would improve readability.
  2. [Figures] Figure captions and axis labels in the experimental plots should explicitly state the guidance scales and mixing weights used for each curve to allow direct replication of the reported comparisons.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback, which helps improve the clarity and reproducibility of our work. We provide point-by-point responses to the major comments below and have incorporated revisions to address the concerns raised.

read point-by-point responses
  1. Referee: [Section 4] Section 4 (Experiments): the abstract and results claim consistent outperformance in T2A across sampling steps, yet no information is provided on the datasets employed, the precise evaluation metrics (e.g., CLAP, FAD, or subjective scores), the number of evaluation samples, or error bars from multiple random seeds. Without these details the statistical reliability of the cross-method comparison cannot be assessed and the central claim remains under-supported.

    Authors: We agree that the initial manuscript lacked sufficient experimental details to fully substantiate the claims. In the revised version, we will expand Section 4 to explicitly describe the datasets (e.g., AudioCaps for T2A evaluations), the precise metrics including CLAP for semantic alignment, FAD for perceptual quality, and any subjective scores used. We will also report the number of evaluation samples and include error bars derived from multiple random seeds to enable assessment of statistical reliability. These additions directly strengthen support for the consistent outperformance results across sampling steps. revision: yes

  2. Referee: [Section 3.2] Section 3.2 (Mixture-of-Guidance formulation): the mixing rule incorporates interaction terms such as the unconditional bad model, but the manuscript does not specify whether the guidance mixing coefficients are held constant across models and datasets or require per-model/per-dataset calibration. This directly affects the claim that the method avoids extensive tuning resources and raises the possibility that reported gains depend on favorable hyper-parameter choices rather than an intrinsic property of the framework.

    Authors: The mixing coefficients in AudioMoG are designed to be held constant and were applied uniformly without per-model or per-dataset recalibration in all reported experiments, including T2A, V2A, text-to-music, and image generation tasks. We will revise Section 3.2 to explicitly state this fixed-coefficient approach and note the specific values used, thereby reinforcing that the method requires no extensive tuning resources beyond the standard sampling procedure. This clarification addresses the concern that gains might stem from favorable hyper-parameter choices. revision: yes

Circularity Check

0 steps flagged

No significant circularity; AudioMoG framework is empirically validated rather than self-referential

full rationale

The paper begins with an analysis of existing CFG and AG techniques, identifies their respective strengths and weaknesses for audio generation, and then proposes a new mixture-of-guidance construction that combines signals and interaction terms. The claimed outperformance is presented as an experimental result at fixed inference speed across sampling steps, not as a mathematical identity or fitted quantity that reduces to the inputs by definition. No equations or steps equate the mixture result to a self-defined quantity, and no load-bearing premise relies on a self-citation chain or imported uniqueness theorem. The derivation remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on an empirical combination of existing guidance methods; limited free parameters likely include mixing coefficients chosen to optimize results.

free parameters (1)
  • guidance mixing coefficients
    Weights used to combine CFG, AG, and interaction signals; chosen or tuned to achieve reported performance gains.
axioms (1)
  • domain assumption Interaction terms from unconditional model versions provide beneficial guidance signals that can be additively combined.
    Invoked when describing the mixture framework that integrates diverse signals to maximize advantages.

pith-pipeline@v0.9.0 · 5779 in / 1205 out tokens · 43003 ms · 2026-05-18T13:05:13.908698+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

79 extracted references · 79 canonical work pages · 6 internal anchors

  1. [1]

    MusicLM: Generating Music From Text

    Andrea Agostinelli, Timo I Denk, Zal \'a n Borsos, Jesse Engel, Mauro Verzetti, Antoine Caillon, Qingqing Huang, Aren Jansen, Adam Roberts, Marco Tagliasacchi, et al. Musiclm: Generating music from text. arXiv preprint arXiv:2301.11325, 2023

  2. [2]

    The million song dataset

    Thierry Bertin-Mahieux, Daniel PW Ellis, Brian Whitman, and Paul Lamere. The million song dataset. In Ismir, volume 2, pp.\ 10, 2011

  3. [3]

    Classifier-free guidance is a predictor-corrector

    Arwen Bradley and Preetum Nakkiran. Classifier-free guidance is a predictor-corrector. arXiv preprint arXiv:2408.09000, 2024

  4. [4]

    Vggsound: A large-scale audio-visual dataset

    Honglie Chen, Weidi Xie, Andrea Vedaldi, and Andrew Zisserman. Vggsound: A large-scale audio-visual dataset. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.\ 721--725. IEEE, 2020

  5. [5]

    Mmaudio: Taming multimodal joint training for high-quality video-to-audio synthesis

    Ho Kei Cheng, Masato Ishii, Akio Hayakawa, Takashi Shibuya, Alexander Schwing, and Yuki Mitsufuji. Mmaudio: Taming multimodal joint training for high-quality video-to-audio synthesis. In CVPR, pp.\ 28901--28911, 2025

  6. [6]

    What does guidance do? a fine-grained analysis in a simple setting

    Muthu Chidambaram, Khashayar Gatmiry, Sitan Chen, Holden Lee, and Jianfeng Lu. What does guidance do? a fine-grained analysis in a simple setting. arXiv preprint arXiv:2409.13074, 2024

  7. [7]

    Scaling instruction-finetuned language models

    Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. Scaling instruction-finetuned language models. Journal of Machine Learning Research, 25 0 (70): 0 1--53, 2024

  8. [8]

    Syncfusion: Multimodal onset-synchronized video-to-audio foley synthesis

    Marco Comunit \`a , Riccardo F Gramaccioni, Emilian Postolache, Emanuele Rodol \`a , Danilo Comminiello, and Joshua D Reiss. Syncfusion: Multimodal onset-synchronized video-to-audio foley synthesis. In ICASSP, pp.\ 936--940. IEEE, 2024

  9. [9]

    Simple and controllable music generation

    Jade Copet, Felix Kreuk, Itai Gat, Tal Remez, David Kant, Gabriel Synnaeve, Yossi Adi, and Alexandre D \'e fossez. Simple and controllable music generation. Advances in Neural Information Processing Systems, 36: 0 47704--47720, 2023

  10. [10]

    Text-to-audio generation using instruction tuned LLM and latent diffusion model,

    Ghosal Deepanway, Majumder Navonil, Mehrish Ambuj, and Poria Soujanya. Text-to-audio generation using instruction-tuned llm and latent diffusion model. arXiv preprint arXiv:2304.13731, 2023

  11. [11]

    FMA: A Dataset For Music Analysis

    Micha \"e l Defferrard, Kirell Benzi, Pierre Vandergheynst, and Xavier Bresson. Fma: A dataset for music analysis. arXiv preprint arXiv:1612.01840, 2016

  12. [12]

    Clotho: An audio captioning dataset

    Konstantinos Drossos, Samuel Lipping, and Tuomas Virtanen. Clotho: An audio captioning dataset. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.\ 736--740. IEEE, 2020

  13. [13]

    Conditional generation of audio from video via foley analogies

    Yuexi Du, Ziyang Chen, Justin Salamon, Bryan Russell, and Andrew Owens. Conditional generation of audio from video via foley analogies. In Conference on Computer Vision and Pattern Recognition 2023, 2023

  14. [14]

    Fast timing-conditioned latent audio diffusion

    Zach Evans, CJ Carr, Josiah Taylor, Scott H Hawley, and Jordi Pons. Fast timing-conditioned latent audio diffusion. In Forty-first International Conference on Machine Learning, 2024

  15. [15]

    Stable audio open

    Zach Evans, Julian D Parker, CJ Carr, Zack Zukowski, Josiah Taylor, and Jordi Pons. Stable audio open. In ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.\ 1--5. IEEE, 2025

  16. [16]

    Fsd50k: an open dataset of human-labeled sound events

    Eduardo Fonseca, Xavier Favory, Jordi Pons, Frederic Font, and Xavier Serra. Fsd50k: an open dataset of human-labeled sound events. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 30: 0 829--852, 2021

  17. [17]

    Riffusion-stable diffusion for real-time music generation

    Seth Forsgren and Hayk Martiros. Riffusion-stable diffusion for real-time music generation. URL https://riffusion. com, 2022

  18. [18]

    Audio set: An ontology and human-labeled dataset for audio events

    Jort F Gemmeke, Daniel PW Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R Channing Moore, Manoj Plakal, and Marvin Ritter. Audio set: An ontology and human-labeled dataset for audio events. In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp.\ 776--780. IEEE, 2017

  19. [19]

    Imagebind: One embedding space to bind them all

    Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, and Ishan Misra. Imagebind: One embedding space to bind them all. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.\ 15180--15190, 2023

  20. [20]

    Gemmeke, Aren Jansen, R

    Shawn Hershey, Sourish Chaudhuri, Daniel PW Ellis, Jort F. Gemmeke, Aren Jansen, R. Channing Moore, Manoj Plakal, Devin Platt, Rif A. Saurous, Bryan Seybold, Malcolm Slaney, Ron J. Weiss, and Kevin Wilson. Cnn architectures for large-scale audio classification. In Proc. ICASSP, pp.\ 131--135, 2017

  21. [21]

    Gans trained by a two time-scale update rule converge to a local nash equilibrium

    Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Proc. NeurIPS, pp.\ 6626--6637, 2017

  22. [22]

    Classifier-Free Diffusion Guidance

    Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022

  23. [23]

    Improving sample quality of diffusion models using self-attention guidance

    Susung Hong, Gyuseong Lee, Wooseok Jang, and Seungryong Kim. Improving sample quality of diffusion models using self-attention guidance. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.\ 7462--7471, 2023

  24. [24]

    Make-An-Audio 2: Temporal-enhanced text-to- audio generation,

    Jiawei Huang, Yi Ren, Rongjie Huang, Dongchao Yang, Zhenhui Ye, Chen Zhang, Jinglin Liu, Xiang Yin, Zejun Ma, and Zhou Zhao. Make-an-audio 2: Temporal-enhanced text-to-audio generation. arXiv preprint arXiv:2305.18474, 2023 a

  25. [25]

    Masked autoencoders that listen

    Po-Yao Huang, Hu Xu, Juncheng Li, Alexei Baevski, Michael Auli, Wojciech Galuba, Florian Metze, and Christoph Feichtenhofer. Masked autoencoders that listen. Advances in Neural Information Processing Systems, 35: 0 28708--28720, 2022

  26. [26]

    Make-an-audio: Text-to-audio generation with prompt-enhanced diffusion models

    Rongjie Huang, Jiawei Huang, Dongchao Yang, Yi Ren, Luping Liu, Mingze Li, Zhenhui Ye, Jinglin Liu, Xiang Yin, and Zhou Zhao. Make-an-audio: Text-to-audio generation with prompt-enhanced diffusion models. In International Conference on Machine Learning, pp.\ 13916--13932. PMLR, 2023 b

  27. [27]

    Tan- goflux: Super fast and faithful text to audio generation with flow matching and clap-ranked preference optimization,

    Chia-Yu Hung, Navonil Majumder, Zhifeng Kong, Ambuj Mehrish, Rafael Valle, Bryan Catanzaro, and Soujanya Poria. Tangoflux: Super fast and faithful text to audio generation with flow matching and clap-ranked preference optimization. arXiv preprint arXiv:2412.21037, 2024

  28. [28]

    Spatiotemporal skip guidance for enhanced video diffusion sampling

    Junha Hyung, Kinam Kim, Susung Hong, Min - Jung Kim, and Jaegul Choo. Spatiotemporal skip guidance for enhanced video diffusion sampling. In CVPR, 2025

  29. [29]

    Spg: Improving motion diffusion by smooth perturbation guidance

    Boseong Jeon. Spg: Improving motion diffusion by smooth perturbation guidance. arXiv preprint arXiv:2503.02577, 2025

  30. [30]

    Read, watch and scream! sound generation from text and video

    Yujin Jeong, Yunji Kim, Sanghyuk Chun, and Jiyoung Lee. Read, watch and scream! sound generation from text and video. arXiv preprint arXiv:2407.05551, 2024

  31. [31]

    Freeaudio: Training-free timing planning for controllable long-form text-to-audio generation

    Yuxuan Jiang, Zehua Chen, Zeqian Ju, Chang Li, Weibei Dou, and Jun Zhu. Freeaudio: Training-free timing planning for controllable long-form text-to-audio generation. arXiv preprint arXiv:2507.08557, 2025

  32. [32]

    Guiding a diffusion model with a bad version of itself

    Tero Karras, Miika Aittala, Tuomas Kynk \"a \"a nniemi, Jaakko Lehtinen, Timo Aila, and Samuli Laine. Guiding a diffusion model with a bad version of itself. Advances in Neural Information Processing Systems, 37: 0 52996--53021, 2024 a

  33. [33]

    Analyzing and improving the training dynamics of diffusion models

    Tero Karras, Miika Aittala, Jaakko Lehtinen, Janne Hellsten, Timo Aila, and Samuli Laine. Analyzing and improving the training dynamics of diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 24174--24184, 2024 b

  34. [34]

    Autolora: Autoguidance meets low-rank adaptation for diffusion models

    Artur Kasymov, Marcin Sendera, Michal Stypulkowski, Maciej Zieba, and Przemysław Spurek. Autolora: Autoguidance meets low-rank adaptation for diffusion models. ArXiv, abs/2410.03941, 2024. URL https://api.semanticscholar.org/CorpusID:273186941

  35. [35]

    Sim, and Paris Smaragdis

    Kevin Kilgour, Robin Clark, Kyu J. Sim, and Paris Smaragdis. Fréchet audio distance: A metric for evaluating music enhancement algorithms. In Proc. Interspeech, pp.\ 2350--2354, 2019

  36. [36]

    Audiocaps: Generating captions for audios in the wild

    Chris Dongjoo Kim, Byeongchang Kim, Hyunmin Lee, and Gunhee Kim. Audiocaps: Generating captions for audios in the wild. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp.\ 119--132, 2019

  37. [37]

    Auto-encoding variational bayes, 2013

    Diederik P Kingma, Max Welling, et al. Auto-encoding variational bayes, 2013

  38. [38]

    Plumbley

    Qiuqiang Kong, Yin Cao, Turab Iqbal, Yuxuan Wang, Wenwu Wang, and Mark D. Plumbley. Panns: Large-scale pretrained audio neural networks for audio pattern recognition. In IEEE/ACM Transactions on Audio, Speech, and Language Processing, volume 28, pp.\ 2880--2894, 2020

  39. [39]

    Improving text-to-audio models with synthetic captions

    Zhifeng Kong, Sang-gil Lee, Deepanway Ghosal, Navonil Majumder, Ambuj Mehrish, Rafael Valle, Soujanya Poria, and Bryan Catanzaro. Improving text-to-audio models with synthetic captions. arXiv preprint arXiv:2406.15487, 2024

  40. [40]

    Audiogen: Textually guided audio generation.arXiv preprint arXiv:2209.15352,

    Felix Kreuk, Gabriel Synnaeve, Adam Polyak, Uriel Singer, Alexandre D \'e fossez, Jade Copet, Devi Parikh, Yaniv Taigman, and Yossi Adi. Audiogen: Textually guided audio generation. arXiv preprint arXiv:2209.15352, 2022

  41. [41]

    Solomon Kullback and Richard A. Leibler. On information and sufficiency. The Annals of Mathematical Statistics, 22 0 (1): 0 79--86, 1951

  42. [42]

    Efficient neural music generation

    Max WY Lam, Qiao Tian, Tang Li, Zongyu Yin, Siyuan Feng, Ming Tu, Yuliang Ji, Rui Xia, Mingbo Ma, Xuchen Song, et al. Efficient neural music generation. Advances in Neural Information Processing Systems, 36: 0 17450--17463, 2023

  43. [43]

    Evaluation of algorithms using games: The case of music tagging

    Edith Law, Kris West, Michael I Mandel, Mert Bay, and J Stephen Downie. Evaluation of algorithms using games: The case of music tagging. In ISMIR, pp.\ 387--392. Citeseer, 2009

  44. [44]

    Etta: Elucidating the design space of text-to-audio models

    Sang-gil Lee, Zhifeng Kong, Arushi Goel, Sungwon Kim, Rafael Valle, and Bryan Catanzaro. Etta: Elucidating the design space of text-to-audio models. In ICML, 2025

  45. [45]

    Quality-aware masked diffusion transformer for enhanced music generation

    Chang Li, Ruoyu Wang, Lijuan Liu, Jun Du, Yixuan Sun, Zilu Guo, Zhenrong Zhang, and Yuan Jiang. Quality-aware masked diffusion transformer for enhanced music generation. arXiv e-prints, pp.\ arXiv--2405, 2024 a

  46. [46]

    Jen-1: Text-guided universal music generation with omnidirectional diffusion models

    Peike Patrick Li, Boyu Chen, Yao Yao, Yikai Wang, Allen Wang, and Alex Wang. Jen-1: Text-guided universal music generation with omnidirectional diffusion models. In 2024 IEEE Conference on Artificial Intelligence (CAI), pp.\ 762--769. IEEE, 2024 b

  47. [47]

    Self-guidance: Boosting flow and diffusion generation on their own

    Tiancheng Li, Weijian Luo, Zhiyang Chen, Liyuan Ma, and Guo-Jun Qi. Self-guidance: Boosting flow and diffusion generation on their own. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

  48. [48]

    Audioldm: Text-to-audio generation with latent diffusion models.arXiv preprint arXiv:2301.12503,

    Haohe Liu, Zehua Chen, Zejia Yuan, et al. Audioldm: Text-to-audio generation with latent diffusion models. arXiv preprint arXiv:2301.12503, 2023

  49. [49]

    Audioldm 2: Learning holistic audio generation with self-supervised pretraining

    Haohe Liu, Yi Yuan, Xubo Liu, Xinhao Mei, Qiuqiang Kong, Qiao Tian, Yuping Wang, Wenwu Wang, Yuxuan Wang, and Mark D Plumbley. Audioldm 2: Learning holistic audio generation with self-supervised pretraining. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2024 a

  50. [50]

    Tell what you hear from what you see - video to audio generation through text

    Xiulong Liu, Kun Su, and Eli Shlizerman. Tell what you hear from what you see - video to audio generation through text. In NeurIPS, 2024 b . URL https://openreview.net/forum?id=kr7eN85mIT

  51. [51]

    Decoupled Weight Decay Regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017

  52. [52]

    DPM-Solver++: Fast Solver for Guided Sampling of Diffusion Probabilistic Models

    Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. Dpm-solver++: Fast solver for guided sampling of diffusion probabilistic models. arXiv preprint arXiv:2211.01095, 2022

  53. [53]

    Diff-foley: Synchronized video-to-audio synthesis with latent diffusion models

    Simian Luo, Chuanhao Yan, Chenxu Hu, and Hang Zhao. Diff-foley: Synchronized video-to-audio synthesis with latent diffusion models. Advances in Neural Information Processing Systems, 36: 0 48855--48876, 2023

  54. [54]

    Tango 2: Aligning diffusion-based text-to-audio generations through direct preference optimization

    Navonil Majumder, Chia-Yu Hung, Deepanway Ghosal, Wei-Ning Hsu, Rada Mihalcea, and Soujanya Poria. Tango 2: Aligning diffusion-based text-to-audio generations through direct preference optimization. In Proceedings of the 32nd ACM International Conference on Multimedia, pp.\ 564--572, 2024

  55. [55]

    Foleygen: Visually-guided audio generation

    Xinhao Mei, Varun Nagaraja, Gael Le Lan, Zhaoheng Ni, Ernie Chang, Yangyang Shi, and Vikas Chandra. Foleygen: Visually-guided audio generation. In 2024 IEEE 34th International Workshop on Machine Learning for Signal Processing (MLSP), pp.\ 1--6. IEEE, 2024

  56. [56]

    Mubert-Inc. Mubert. URL https://mubert.com/, https://github.com/MubertAI/ Mubert-Text-to-Music, 2022

  57. [57]

    Scalable diffusion models with transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pp.\ 4195--4205, 2023

  58. [58]

    Unconditional priors matter! improving conditional generation of fine-tuned diffusion models

    Prin Phunyaphibarn, Phillip Y Lee, Jaihoon Kim, and Minhyuk Sung. Unconditional priors matter! improving conditional generation of fine-tuned diffusion models. arXiv preprint arXiv:2503.20240, 2025

  59. [59]

    Learning Transferable Visual Models From Natural Language Supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision, 2021. URL https://arxiv.org/abs/2103.00020

  60. [60]

    Direct preference optimization: Your language model is secretly a reward model

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36: 0 53728--53741, 2023

  61. [61]

    High-resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj \"o rn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.\ 10684--10695, 2022

  62. [62]

    arXiv preprint arXiv:2407.02687 , year =

    Seyedmorteza Sadat, Manuel Kansy, Otmar Hilliges, and Romann M Weber. No training, no problem: Rethinking classifier-free guidance for diffusion models. arXiv preprint arXiv:2407.02687, 2024

  63. [63]

    Improved techniques for training gans

    Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. In Proc. NeurIPS, pp.\ 2234--2242, 2016

  64. [64]

    Mo \^ usai: Text-to-music generation with long-context latent diffusion

    Flavio Schneider, Ojasv Kamal, Zhijing Jin, and Bernhard Sch \"o lkopf. Mo \^ usai: Text-to-music generation with long-context latent diffusion. arXiv preprint arXiv:2301.11757, 2023

  65. [65]

    I hear your true colors: Image guided audio generation

    Roy Sheffer and Yossi Adi. I hear your true colors: Image guided audio generation. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.\ 1--5. IEEE, 2023

  66. [66]

    From vision to audio and beyond: A unified model for audio-visual representation and generation

    Kun Su, Xiulong Liu, and Eli Shlizerman. From vision to audio and beyond: A unified model for audio-visual representation and generation. arXiv preprint arXiv:2409.19132, 2024

  67. [67]

    Neural discrete representation learning

    Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete representation learning. Advances in neural information processing systems, 30, 2017

  68. [68]

    V2a-mapper: A lightweight solution for vision-to-audio generation by connecting foundation models

    Heng Wang, Jianbo Ma, Santiago Pascual, Richard Cartwright, and Weidong Cai. V2a-mapper: A lightweight solution for vision-to-audio generation by connecting foundation models. In AAAI, volume 38, pp.\ 15492--15501, 2024 a

  69. [69]

    Tiva: Time-aligned video-to-audio generation

    Xihua Wang, Yuyue Wang, Yihan Wu, Ruihua Song, Xu Tan, Zehua Chen, Hongteng Xu, and Guodong Sui. Tiva: Time-aligned video-to-audio generation. In ACM Multimedia, 2024 b

  70. [70]

    Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation

    Yusong Wu, Ke Chen, Tianyu Zhang, Yuchen Hui, Taylor Berg-Kirkpatrick, and Shlomo Dubnov. Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation. In ICASSP, pp.\ 1--5. IEEE, 2023

  71. [71]

    Sonicvisionlm: Playing sound with vision language models

    Zhifeng Xie, Shengye Yu, Qile He, and Mengtian Li. Sonicvisionlm: Playing sound with vision language models. In CVPR, pp.\ 26866--26875, 2024

  72. [72]

    Video-to-audio generation with hidden alignment

    Manjie Xu, Chenxing Li, Yong Ren, Rilin Chen, Yu Gu, Wei Liang, and Dong Yu. Video-to-audio generation with hidden alignment. arXiv preprint arXiv:2407.07464, 2024

  73. [73]

    Diffsound: Discrete diffusion model for text-to-sound generation

    Dongchao Yang, Jianwei Yu, Helin Wang, Wen Wang, Chao Weng, Yuexian Zou, and Dong Yu. Diffsound: Discrete diffusion model for text-to-sound generation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 31: 0 1720--1733, 2023

  74. [74]

    Foley- crafter: Bring silent videos to life with lifelike and synchro- nized sounds,

    Yiming Zhang, Yicheng Gu, Yanhong Zeng, Zhening Xing, Yuancheng Wang, Zhizheng Wu, and Kai Chen. Foleycrafter: Bring silent videos to life with lifelike and synchronized sounds. arXiv preprint arXiv:2407.01494, 2024

  75. [75]

    Domain guidance: A simple transfer approach for a pre-trained diffusion model

    Jincheng Zhong, Xiangcheng Zhang, Jianmin Wang, and Mingsheng Long. Domain guidance: A simple transfer approach for a pre-trained diffusion model. arXiv preprint arXiv:2504.01521, 2025

  76. [76]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

  77. [77]

    @esa (Ref

    \@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

  78. [78]

    \@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

  79. [79]

    Wڵk䤦455e ­ZJVիu-_h `Y(

    @open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...