DreamAudio: Customized Text-to-Audio Generation with Diffusion Models

Haohe Liu; Mark D. Plumbley; Wenwu Wang; Xiyuan Kang; Xubo Liu; Yi Yuan; Yuxuan Wang; Zhuo Chen

arxiv: 2509.06027 · v3 · submitted 2025-09-07 · 💻 cs.SD · cs.AI· eess.AS

DreamAudio: Customized Text-to-Audio Generation with Diffusion Models

Yi Yuan , Xubo Liu , Haohe Liu , Xiyuan Kang , Zhuo Chen , Yuxuan Wang , Mark D. Plumbley , Wenwu Wang This is my paper

Pith reviewed 2026-05-18 18:22 UTC · model grok-4.3

classification 💻 cs.SD cs.AIeess.AS

keywords customized text-to-audiodiffusion modelspersonalized audio eventsreference-based audio generationCTTAtext-to-audio generationaudio customization

0 comments

The pith

DreamAudio generates new audio clips that include specific events from a few user reference samples while following text prompts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Existing text-to-audio models align sounds with descriptions but offer little control over fine acoustic details of particular events. DreamAudio introduces a diffusion-based framework that extracts auditory information from a small number of reference audio clips containing personalized events. The system then recombines this information to create fresh audio outputs that incorporate the referenced events. Experiments show the outputs stay consistent with the custom features and match the text prompts, while also performing comparably on ordinary text-to-audio tasks. The work supplies new training and test datasets plus a human-curated benchmark drawn from real customized-generation cases.

Core claim

Given a few reference audio samples containing personalized audio events, the DreamAudio system can generate new audio samples that include these specific events and are aligned well with the input text prompts. The framework identifies auditory information from the references and recombines it during diffusion-based generation to produce customized results.

What carries the argument

A framework that enables the diffusion model to identify and apply auditory information from user-provided reference audio concepts.

Load-bearing premise

Reference audio samples contain clean, identifiable, and transferable information about the personalized events that the model can extract and recombine without interference from noise or unrelated acoustic traits.

What would settle it

Generate outputs from held-out reference sets and test whether listeners can reliably detect the specific referenced acoustic events in the results at rates well above chance, or whether text-prompt alignment drops below that of standard diffusion text-to-audio models.

Figures

Figures reproduced from arXiv: 2509.06027 by Haohe Liu, Mark D. Plumbley, Wenwu Wang, Xiyuan Kang, Xubo Liu, Yi Yuan, Yuxuan Wang, Zhuo Chen.

**Figure 1.** Figure 1: A demo of DreamAudio for audio generation with customized content of “monster fighting” and “Minion talking”. The system takes both the text prompt and user-provided audio-caption pairs as the reference concepts, and generate audio content consistent with the description ”Monster is fighting with a Minion”. audio features, these models have shown strong capabilities in generating audio samples of high qual… view at source ↗

**Figure 2.** Figure 2: The inference pipeline of the DreamAudio. The input prompt and reference concept are encoded in two paralleled paths through the Flan-T5 Encoder and the reference audio feature is encoded by the VAE Encoder. Along with the noised data zλ, four inputs are forwarded to the generator with MRC structure to generate the denoised data, followed by the VAE decoder and vocoder to reconstruct the final output. The … view at source ↗

**Figure 3.** Figure 3: The details of the MRC UNet network, where the model takes the reference feature [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: The generation pipeline of the customized datasets, with the Customized-Concatenation on the left and Customized-Overlay on the right. All the clips [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: The details of the MRC UNet network for reference length fine-tuning. [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

read the original abstract

With the development of large-scale diffusion-based and language-modeling-based generative models, impressive progress has been achieved in text-to-audio generation. Despite producing high-quality outputs, existing text-to-audio models mainly aim to generate semantically aligned sound and fall short of controlling fine-grained acoustic characteristics of specific sounds. As a result, users who need specific sound content may find it difficult to generate the desired audio clips. In this paper, we present DreamAudio for customized text-to-audio generation (CTTA). Specifically, we introduce a new framework that is designed to enable the model to identify auditory information from user-provided reference concepts for audio generation. Given a few reference audio samples containing personalized audio events, our system can generate new audio samples that include these specific events. In addition, two types of datasets are developed for training and testing the proposed systems. The experiments show that DreamAudio generates audio samples that are highly consistent with the customized audio features and aligned well with the input text prompts. Furthermore, DreamAudio offers comparable performance in general text-to-audio tasks. We also provide a human-involved dataset containing audio events from real-world CTTA cases as the benchmark for customized generation tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DreamAudio adds reference conditioning to diffusion text-to-audio and ships new datasets, but the results leave open whether the model isolates events or just copies broader acoustic traits from the clips.

read the letter

DreamAudio sets up a diffusion framework that takes a few reference audio clips containing specific events and tries to generate new samples that include those events while following a text prompt. The main addition is a dedicated setup for this customized task plus two datasets for training and testing, along with a human-collected real-world benchmark of audio events from actual use cases. The reported experiments indicate the outputs stay consistent with the reference features and align with the prompts, and the model does not lose much ground on standard text-to-audio performance.

Referee Report

3 major / 2 minor

Summary. The manuscript presents DreamAudio, a diffusion-based framework for customized text-to-audio generation (CTTA). Given a few reference audio samples containing personalized audio events, the system generates new samples that incorporate these specific events while remaining aligned with input text prompts. The authors introduce two new datasets for training and testing, report experimental results indicating high consistency with the customized features and good prompt alignment, show comparable performance to standard text-to-audio tasks, and release a human-involved benchmark dataset drawn from real-world CTTA scenarios.

Significance. If the experimental claims hold under detailed scrutiny, the work would represent a meaningful step toward fine-grained, reference-driven control in audio generation, addressing the limitation of existing text-to-audio models that primarily achieve semantic alignment rather than event-specific customization. The accompanying datasets and benchmark would also supply useful resources for future research on personalized audio synthesis.

major comments (3)

[Methods] Methods / conditioning subsection: The reference encoder (typically CLAP or WavLM embeddings) is fed into standard cross-attention or FiLM layers without an explicit event-disentanglement stage. This leaves open the possibility that the UNet satisfies the training objective by copying global acoustic statistics (timbre, background, recording artifacts) rather than isolating the target auditory event, directly undermining the central claim that a few references supply cleanly transferable, event-specific features.
[Experiments] Experiments / results section: The reported positive outcomes on consistency and alignment lack accompanying quantitative metrics, ablation tables (e.g., performance on noisy vs. clean references or single- vs. multi-source clips), and error analysis. Without these, it is impossible to verify robustness against the weakest assumption that reference clips contain clean, identifiable events.
[Datasets] Dataset construction: The two new datasets are introduced for CTTA training and testing, yet the selection and annotation protocol for ensuring that reference clips isolate the intended personalized events (rather than composite or noisy scenes) is not specified. This detail is load-bearing for the benchmark's validity and for the claim of real-world applicability.

minor comments (2)

[Abstract] Abstract: Adding one or two key quantitative scores (e.g., consistency or alignment metrics) would make the summary of experimental outcomes more informative.
Notation and terminology: The phrases 'customized audio features' and 'personalized audio events' are used interchangeably; a single consistent term would improve clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback, which highlights important aspects for improving the clarity and robustness of our work on DreamAudio. We address each major comment below and outline the revisions we will incorporate.

read point-by-point responses

Referee: [Methods] Methods / conditioning subsection: The reference encoder (typically CLAP or WavLM embeddings) is fed into standard cross-attention or FiLM layers without an explicit event-disentanglement stage. This leaves open the possibility that the UNet satisfies the training objective by copying global acoustic statistics (timbre, background, recording artifacts) rather than isolating the target auditory event, directly undermining the central claim that a few references supply cleanly transferable, event-specific features.

Authors: We acknowledge the validity of this concern: without an explicit disentanglement module, it is possible in principle for the model to rely on global acoustic properties. Our design instead uses the joint training objective (reference consistency plus text alignment) and the iterative nature of the diffusion process to encourage focus on the target event. The cross-attention layers allow the UNet to attend selectively to reference features that match the prompt semantics. To strengthen this argument, we will revise the methods section with additional discussion of the implicit disentanglement mechanism and include a new ablation comparing conditioning with and without text prompts. revision: yes
Referee: [Experiments] Experiments / results section: The reported positive outcomes on consistency and alignment lack accompanying quantitative metrics, ablation tables (e.g., performance on noisy vs. clean references or single- vs. multi-source clips), and error analysis. Without these, it is impossible to verify robustness against the weakest assumption that reference clips contain clean, identifiable events.

Authors: We agree that more granular quantitative support and ablations would strengthen the experimental claims. The current results include consistency and alignment scores, yet we did not provide the requested breakdowns or error analysis. In the revised manuscript we will add ablation tables for noisy versus clean references and single- versus multi-source clips, together with a quantitative error analysis section that reports failure cases and robustness metrics. revision: yes
Referee: [Datasets] Dataset construction: The two new datasets are introduced for CTTA training and testing, yet the selection and annotation protocol for ensuring that reference clips isolate the intended personalized events (rather than composite or noisy scenes) is not specified. This detail is load-bearing for the benchmark's validity and for the claim of real-world applicability.

Authors: We recognize that a clear description of the annotation protocol is necessary for reproducibility and to support claims of real-world applicability. The reference clips were chosen and verified by human annotators to contain isolated target events; however, this process was only summarized briefly. We will expand the datasets section to detail the selection criteria, annotation guidelines, quality-control steps, and inter-annotator agreement used to ensure references isolate the intended personalized events. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical framework with new datasets and experiments

full rationale

The paper introduces DreamAudio as a new conditioning framework for reference-based audio event transfer in diffusion models, trained and evaluated on two newly developed datasets plus a human-involved real-world benchmark. No equations, derivations, or self-citations are presented that reduce claimed performance or event isolation to quantities fitted from the same data by construction. The central claims rest on experimental consistency with text prompts and reference features rather than any tautological renaming or load-bearing self-reference.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work rests on standard diffusion-model conditioning assumptions plus the creation of new datasets; no explicit free parameters or invented entities are described in the abstract.

axioms (1)

domain assumption Diffusion models can be effectively conditioned on both text prompts and reference audio features for controllable generation.
Invoked when the framework is said to identify auditory information from references and align with text.

pith-pipeline@v0.9.0 · 5763 in / 1124 out tokens · 39156 ms · 2026-05-18T18:22:55.284053+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

81 extracted references · 81 canonical work pages · 6 internal anchors

[1]

A comprehensive survey of AI-generated content: A history of generative AI from GAN to ChatGPT,

Y . Cao, S. Li, Y . Liu, Z. Yan, Y . Dai, P. S. Yu, and L. Sun, “A comprehensive survey of AI-generated content: A history of generative AI from GAN to ChatGPT,”arXiv:2303.04226, 2023

work page arXiv 2023
[2]

AudioLDM: Text-to-Audio generation with latent diffusion models,

H. Liu, Z. Chen, Y . Yuan, X. Mei, X. Liu, D. Mandic, W. Wang, and M. D. Plumbley, “AudioLDM: Text-to-Audio generation with latent diffusion models,” inProceedings of the International Conference on Machine Learning, 2023, pp. 21 450–21 474

work page 2023
[3]

Sound to visual scene generation by audio-to-visual latent alignment,

K. Sung-Bin, A. Senocak, H. Ha, A. Owens, and T.-H. Oh, “Sound to visual scene generation by audio-to-visual latent alignment,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 6430–6440

work page 2023
[4]

I hear your true colors: Image guided audio gen- eration,

R. Sheffer and Y . Adi, “I hear your true colors: Image guided audio gen- eration,” inProceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, 2023

work page 2023
[5]

FoleyGen: Visually-guided audio generation,

X. Mei, V . Nagaraja, G. Le Lan, Z. Ni, E. Chang, Y . Shi, and V . Chandra, “FoleyGen: Visually-guided audio generation,” inProceedings of IEEE International Workshop on Machine Learning for Signal Processing, 2024

work page 2024
[6]

Taming visually guided sound generation,

V . Iashin and E. Rahtu, “Taming visually guided sound generation,” in Proceedings of British Machine Vision Conference, 2021

work page 2021
[7]

AudioGen: Textually guided audio generation,

F. Kreuk, G. Synnaeve, A. Polyak, U. Singer, A. D ´efossez, J. Copet, D. Parikh, Y . Taigman, and Y . Adi, “AudioGen: Textually guided audio generation,”International Conference on Learning Representations, 2022

work page 2022
[8]

Riffusion: Stable diffusion for real-time music generation,

S. Forsgren and H. Martiros, “Riffusion: Stable diffusion for real-time music generation,” 2022.[Online]. Available: https://riffusion.com/about

work page 2022
[9]

WavJourney: Compositional audio creation with large language models,

X. Liu, Z. Zhu, H. Liu, Y . Yuan, M. Cui, Q. Huang, J. Liang, Y . Cao, Q. Kong, M. D. Plumbleyet al., “WavJourney: Compositional audio creation with large language models,”arXiv:2307.14335, 2023. 11

work page arXiv 2023
[10]

Denoising diffusion implicit models,

J. Song, C. Meng, and S. Ermon, “Denoising diffusion implicit models,” inProceedings of International Conference on Learning Representa- tions, 2020

work page 2020
[11]

Denoising diffusion probabilistic models,

J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” inAdvances in Neural Information Processing Systems, vol. 33, 2020, pp. 6840–6851

work page 2020
[12]

AudioSet: An ontology and human-labeled dataset for audio events,

J. F. Gemmeke, D. P. W. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. C. Moore, M. Plakal, and M. Ritter, “AudioSet: An ontology and human-labeled dataset for audio events,” inProceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, 2017, pp. 776–780

work page 2017
[13]

WavCaps: A ChatGPT-assisted weakly-labelled audio captioning dataset for audio-language multimodal research,

X. Mei, C. Meng, H. Liu, Q. Kong, T. Ko, C. Zhao, M. D. Plumbley, Y . Zou, and W. Wang, “WavCaps: A ChatGPT-assisted weakly-labelled audio captioning dataset for audio-language multimodal research,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 32, pp. 3339–3354, 2024

work page 2024
[14]

A large-scale dataset for audio- language representation learning,

L. Sun, X. Xu, M. Wu, and W. Xie, “A large-scale dataset for audio- language representation learning,”arXiv:2309.11500, 2023

work page arXiv 2023
[15]

Sound-VECaps: Improving audio generation with visual enhanced captions,

Y . Yuan, D. Jia, X. Zhuang, Y . Chen, Z. Liu, Z. Chen, Y . Wang, Y . Wang, X. Liu, X. Kanget al., “Sound-VECaps: Improving audio generation with visual enhanced captions,” inAudio Imagination: NeurIPS 2024 Workshop AI-Driven Speech, Music, and Sound Generation

work page 2024
[16]

AudioGen: textually guided audio generation,

F. Kreuk, G. Synnaeve, A. Polyak, U. Singer, A. D ´efossez, J. Copet, D. Parikh, Y . Taigman, and Y . Adi, “AudioGen: textually guided audio generation,” inProceedings of International Conference on Learning Representations, 2023

work page 2023
[17]

Diffsound: Discrete diffusion model for text-to-sound generation,

D. Yang, J. Yu, H. Wang, W. Wang, C. Weng, Y . Zou, and D. Yu, “Diffsound: Discrete diffusion model for text-to-sound generation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023

work page 2023
[19]

Make-An-Audio 2: Temporal-enhanced text-to- audio generation,

J. Huang, Y . Ren, R. Huang, D. Yang, Z. Ye, C. Zhang, J. Liu, X. Yin, Z. Ma, and Z. Zhao, “Make-An-Audio 2: Temporal-enhanced text-to- audio generation,”arXiv:2305.18474, 2023

work page arXiv 2023
[20]

Retrieval-augmented text-to-audio generation,

Y . Yuan, H. Liu, X. Liu, Q. Huang, M. D. Plumbley, and W. Wang, “Retrieval-augmented text-to-audio generation,” inProceedings of IEEE International Conference on Acoustics, Speech and Signal, 2024, pp. 581–585

work page 2024
[21]

Audioldm 2: Learning holistic audio generation with self-supervised pretraining,

H. Liu, Y . Yuan, X. Liu, X. Mei, Q. Kong, Q. Tian, Y . Wang, W. Wang, Y . Wang, and M. D. Plumbley, “Audioldm 2: Learning holistic audio generation with self-supervised pretraining,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2024

work page 2024
[22]

DreamBooth: Fine tuning text-to-image diffusion models for subject- driven generation,

N. Ruiz, Y . Li, V . Jampani, Y . Pritch, M. Rubinstein, and K. Aberman, “DreamBooth: Fine tuning text-to-image diffusion models for subject- driven generation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 22 500–22 510

work page 2023
[23]

Zero-shot text-to-image generation,

A. Ramesh, M. Pavlov, G. Goh, S. Gray, C. V oss, A. Radford, M. Chen, and I. Sutskever, “Zero-shot text-to-image generation,” inProceedings of International Conference on Machine Learning, 2021, pp. 8821–8831

work page 2021
[24]

Hierarchical Text-Conditional Image Generation with CLIP Latents

A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen, “Hierarchical text-conditional image generation with CLIP latents,”arXiv:2204.06125, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[25]

Improving image generation with better captions,

J. Betker, G. Goh, L. Jing, T. Brooks, J. Wang, L. Li, L. Ouyang, J. Zhuang, J. Lee, Y . Guoet al., “Improving image generation with better captions,”Computer Science., vol. 2, no. 3, 2023

work page 2023
[26]

High- resolution image synthesis with latent diffusion models,

R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High- resolution image synthesis with latent diffusion models,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion, 2022, pp. 10 684–10 695

work page 2022
[27]

Scaling rectified flow transformers for high-resolution image synthesis,

P. Esser, S. Kulal, A. Blattmann, R. Entezari, J. M ¨uller, H. Saini, Y . Levi, D. Lorenz, A. Sauer, F. Boeselet al., “Scaling rectified flow transformers for high-resolution image synthesis,” inProceedings of International Conference on Machine Learning, 2024

work page 2024
[28]

SoundCTM: Uniting score-based and consistency models for text-to-sound generation,

K. Saito, D. Kim, T. Shibuya, C.-H. Lai, Z. Zhong, Y . Takida, and Y . Mitsufuji, “SoundCTM: Uniting score-based and consistency models for text-to-sound generation,”arXiv:2405.18503, 2024

work page arXiv 2024
[29]

Audiobox TTA- RAG: Improving zero-shot and few-shot text-to-audio with retrieval- augmented generation,

M. Yang, B. Shi, M. Le, W.-N. Hsu, and A. Tjandra, “Audiobox TTA- RAG: Improving zero-shot and few-shot text-to-audio with retrieval- augmented generation,”arXiv:2411.05141, 2024

work page arXiv 2024
[30]

Adding conditional control to text-to-image diffusion models,

L. Zhang, A. Rao, and M. Agrawala, “Adding conditional control to text-to-image diffusion models,” inProceedings of IEEE International Conference on Computer Vision, 2023

work page 2023
[31]

AudioCaps: Generating captions for audios in the wild,

C. D. Kim, B. Kim, H. Lee, and G. Kim, “AudioCaps: Generating captions for audios in the wild,” inProceedings of Conference of the North American Chapter of the Association for Computational Linguistics, 2019, pp. 119–132

work page 2019
[32]

Score-based generative modeling through stochastic differen- tial equations,

Y . Song, J. Sohl-Dickstein, D. Kingma, A. Kumar, S. Ermon, and B. Poole, “Score-based generative modeling through stochastic differen- tial equations,” inProceedings of International Conference on Learning Representations, 2021

work page 2021
[33]

Diffusion models beat GANs on image synthesis,

P. Dhariwal and A. Nichol, “Diffusion models beat GANs on image synthesis,” inAdvances in Neural Information Processing Systems, 2021

work page 2021
[34]

Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding

C. Saharia, W. Chan, S. Saxena, L. Li, J. Whang, E. Denton, S. K. S. Ghasemipour, B. K. Ayan, S. S. Mahdavi, R. G. Lopes, T. Salimans, J. Ho, D. J. Fleet, and M. Norouzi, “Photorealistic text-to-image dif- fusion models with deep language understanding,”arXiv:2205.11487, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[35]

Image super-resolution via iterative refinement,

C. Saharia, J. Ho, W. Chan, T. Salimans, D. J. Fleet, and M. Norouzi, “Image super-resolution via iterative refinement,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 4, pp. 4713– 4726, 2022

work page 2022
[36]

Wave- Grad: Estimating gradients for waveform generation,

N. Chen, Y . Zhang, H. Zen, R. Weiss, M. Norouzi, and W. Chan, “Wave- Grad: Estimating gradients for waveform generation,” inProceedings of International Conference on Learning Representations, 2021

work page 2021
[37]

DiffWave: A versatile diffusion model for audio synthesis,

Z. Kong, W. Ping, J. Huang, K. Zhao, and B. Catanzaro, “DiffWave: A versatile diffusion model for audio synthesis,” inInternational Confer- ence on Learning Representations, 2021

work page 2021
[38]

Make-A-Video: Text-to-video generation without text-video data,

U. Singer, A. Polyak, T. Hayes, X. Yin, J. An, S. Zhang, Q. Hu, H. Yang, O. Ashual, O. Gafniet al., “Make-A-Video: Text-to-video generation without text-video data,” inProceedings of International Conference on Learning Representations, 2022

work page 2022
[39]

Imagen Video: High Definition Video Generation with Diffusion Models

J. Ho, W. Chan, C. Saharia, J. Whang, R. Gao, A. Gritsenko, D. P. Kingma, B. Poole, M. Norouzi, D. J. Fleet, and T. Salimans, “Ima- gen Video: High definition video generation with diffusion models,” arXiv:2210.02303, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[40]

Grad- TTS: A diffusion probabilistic model for text-to-speech,

V . Popov, I. V ovk, V . Gogoryan, T. Sadekova, and M. Kudinov, “Grad- TTS: A diffusion probabilistic model for text-to-speech,” inProceedings of International Conference on Machine Learning, 2021, pp. 8599–8608

work page 2021
[41]

ResGrad: Residual denoising diffusion probabilistic models for text to speech,

Z. Chen, Y . Wu, Y . Leng, J. Chen, H. Liu, X. Tan, Y . Cui, K. Wang, L. He, S. Zhao, J. Bian, and D. Mandic, “ResGrad: Residual denoising diffusion probabilistic models for text to speech,”arXiv preprint:2212.14518, 2022

work page arXiv 2022
[42]

Bilateral denoising diffusion models,

M. Lam, J. Wang, R. Huang, D. Su, and D. Yu, “Bilateral denoising diffusion models,” inInternational Conference on Learning Represen- tations, 2022

work page 2022
[43]

Priorgrad: Improving conditional denoising diffu- sion models with data-driven adaptive prior,

S. Lee, H. Kim, C. Shin, X. Tan, C. Liu, Q. Meng, T. Qin, W. Chen, S. Yoon, and T. Liu, “Priorgrad: Improving conditional denoising diffu- sion models with data-driven adaptive prior,” inInternational Conference on Learning Representations, 2022

work page 2022
[44]

InferGrad: Improving diffusion models for vocoder by considering inference in training,

Z. Chen, X. Tan, K. Wang, S. Pan, D. Mandic, L. He, and S. Zhao, “InferGrad: Improving diffusion models for vocoder by considering inference in training,” inProceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, 2022

work page 2022
[45]

Acoustic scene generation with conditional SampleRNN,

Q. Kong, Y . Xu, T. Iqbal, Y . Cao, W. Wang, and M. D. Plumbley, “Acoustic scene generation with conditional SampleRNN,” inProceed- ings of IEEE International Conference on Acoustics, Speech and Signal Processing, 2019, pp. 925–929

work page 2019
[46]

Conditional sound generation using neural discrete time-frequency representation learning,

X. Liu, T. Iqbal, J. Zhao, Q. Huang, M. Plumbley, and W. Wang, “Conditional sound generation using neural discrete time-frequency representation learning,”Proceedings of IEEE International Workshop on Machine Learning for Signal Processing, 2021

work page 2021
[47]

Leveraging pre-trained AudioLDM for sound generation: A benchmark study,

Y . Yuan, H. Liu, J. Liang, X. Liu, M. D. Plumbley, and W. Wang, “Leveraging pre-trained AudioLDM for sound generation: A benchmark study,” inProceedings of European Association for Signal Processing, 2023

work page 2023
[48]

HiFi-GAN: generative adversarial networks for efficient and high fidelity speech synthesis,

J. Kong, J. Kim, and J. Bae, “HiFi-GAN: generative adversarial networks for efficient and high fidelity speech synthesis,” inProceedings of the International Conference on Neural Information Processing Systems, 2020, pp. 17 022–17 033

work page 2020
[49]

MelGAN: Generative adversarial networks for conditional waveform synthesis,

K. Kumar, R. Kumar, T. De Boissiere, L. Gestin, W. Z. Teoh, J. Sotelo, A. De Brebisson, Y . Bengio, and A. C. Courville, “MelGAN: Generative adversarial networks for conditional waveform synthesis,”Advances in Neural Information Processing Systems, vol. 32, 2019

work page 2019
[50]

BigVGAN: A universal neural vocoder with large-scale training,

S.-g. Lee, W. Ping, B. Ginsburg, B. Catanzaro, and S. Yoon, “BigVGAN: A universal neural vocoder with large-scale training,” inProceedings of International Conference on Learning Representations, 2022

work page 2022
[51]

Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation,

Y . Wu, K. Chen, T. Zhang, Y . Hui, T. Berg-Kirkpatrick, and S. Dubnov, “Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation,” inProceedings of IEEE Inter- national Conference on Acoustics, Speech and Signal Processing, 2023. 12

work page 2023
[52]

Text-to-audio generation using instruction tuned LLM and latent diffusion model,

D. Ghosal, N. Majumder, A. Mehrish, and S. Poria, “Text-to-audio generation using instruction tuned LLM and latent diffusion model,” arXiv:2304.13731, 2023

work page arXiv 2023
[53]

Exploring the limits of transfer learning with a unified text-to-text transformer,

C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y . Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer learning with a unified text-to-text transformer,”Journal of Machine Learning Research, vol. 21, no. 1, pp. 5485–5551, 2020

work page 2020
[54]

Using pre-training can improve model robustness and uncertainty,

D. Hendrycks, K. Lee, and M. Mazeika, “Using pre-training can improve model robustness and uncertainty,” inProceedings of International Conference on Machine Learning, vol. 97, 2019, pp. 2712–2721

work page 2019
[55]

Text-driven Foley sound generation with latent diffusion model,

Y . Yuan, H. Liu, X. Kang, P. Wu, M. D. Plumbley, and W. Wang, “Text-driven Foley sound generation with latent diffusion model,” in Proceedings of the Detection and Classification of Acoustic Scenes and Events Workshop, 2023, pp. 231–235

work page 2023
[56]

Masactrl: Tuning-free mutual self-attention control for consistent image synthesis and editing,

M. Cao, X. Wang, Z. Qi, Y . Shan, X. Qie, and Y . Zheng, “Masactrl: Tuning-free mutual self-attention control for consistent image synthesis and editing,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 22 560–22 570

work page 2023
[57]

Anydoor: Zero-shot object-level image customization,

X. Chen, L. Huang, Y . Liu, Y . Shen, D. Zhao, and H. Zhao, “Anydoor: Zero-shot object-level image customization,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 6593–6602

work page 2024
[58]

Encoder-based domain tuning for fast personalization of text-to- image models,

R. Gal, M. Arar, Y . Atzmon, A. H. Bermano, G. Chechik, and D. Cohen- Or, “Encoder-based domain tuning for fast personalization of text-to- image models,”ACM Transactions on Graphics, vol. 42, no. 4, 2023

work page 2023
[59]

Imagic: Text-based real image editing with diffusion models,

B. Kawar, S. Zada, O. Lang, O. Tov, H. Chang, T. Dekel, I. Mosseri, and M. Irani, “Imagic: Text-based real image editing with diffusion models,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 6007–6017

work page 2023
[60]

Key-locked rank one editing for text-to-image personalization,

Y . Tewel, R. Gal, G. Chechik, and Y . Atzmon, “Key-locked rank one editing for text-to-image personalization,” inProceedings of ACM SIGGRAPH, 2023, pp. 1–11

work page 2023
[61]

An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion

R. Gal, Y . Alaluf, Y . Atzmon, O. Patashnik, A. H. Bermano, G. Chechik, and D. Cohen-Or, “An image is worth one word: Personalizing text-to- image generation using textual inversion,”arXiv:2208.01618, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[62]

arXiv preprint arXiv:2303.09522 (2023)

A. V oynov, Q. Chu, D. Cohen-Or, and K. Aberman, “P+: Extended textual conditioning in text-to-image generation,”arXiv:2303.09522, 2023

work page arXiv 2023
[63]

A neural space- time representation for text-to-image personalization,

Y . Alaluf, E. Richardson, G. Metzer, and D. Cohen-Or, “A neural space- time representation for text-to-image personalization,”ACM Transac- tions on Graphics, vol. 42, no. 6, 2023

work page 2023
[64]

Improving expressivity of GNNs with subgraph- specific factor embedded normalization,

K. Chen, S. Liu, T. Zhu, J. Qiao, Y . Su, Y . Tian, T. Zheng, H. Zhang, Z. Feng, J. Yeet al., “Improving expressivity of GNNs with subgraph- specific factor embedded normalization,” inProceedings of the ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2023, pp. 237–249

work page 2023
[65]

BLIP-Diffusion: Pre-trained subject represen- tation for controllable text-to-image generation and editing,

D. Li, J. Li, and S. Hoi, “BLIP-Diffusion: Pre-trained subject represen- tation for controllable text-to-image generation and editing,”Advances in Neural Information Processing Systems, vol. 36, 2024

work page 2024
[66]

Cones: Concept neurons in diffusion models for customized generation,

Z. Liu, R. Feng, K. Zhu, Y . Zhang, K. Zheng, Y . Liu, D. Zhao, J. Zhou, and Y . Cao, “Cones: Concept neurons in diffusion models for customized generation,”arXiv:2303.05125, 2023

work page arXiv 2023
[67]

HyperDreamBooth: Hypernetworks for fast personalization of text-to-image models,

N. Ruiz, Y . Li, V . Jampani, W. Wei, T. Hou, Y . Pritch, N. Wadhwa, M. Rubinstein, and K. Aberman, “HyperDreamBooth: Hypernetworks for fast personalization of text-to-image models,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 6527–6536

work page 2024
[68]

Elite: Encoding visual concepts into textual embeddings for customized text-to-image generation,

Y . Wei, Y . Zhang, Z. Ji, J. Bai, L. Zhang, and W. Zuo, “Elite: Encoding visual concepts into textual embeddings for customized text-to-image generation,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 15 943–15 953

work page 2023
[69]

FreeCustom: Tuning-free customized image generation for multi- concept composition,

G. Ding, C. Zhao, W. Wang, Z. Yang, Z. Liu, H. Chen, and C. Shen, “FreeCustom: Tuning-free customized image generation for multi- concept composition,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 9089–9098

work page 2024
[70]

KNN-Diffusion: Image generation via large-scale retrieval,

S. Sheynin, O. Ashual, A. Polyak, U. Singer, O. Gafni, E. Nachmani, and Y . Taigman, “KNN-Diffusion: Image generation via large-scale retrieval,” inProceedings of International Conference on Learning Representations, 2023

work page 2023
[71]

Re-Imagen: Retrieval- augmented text-to-image generator,

W. Chen, H. Hu, C. Saharia, and W. W. Cohen, “Re-Imagen: Retrieval- augmented text-to-image generator,” inProceedings of International Conference on Learning Representations, 2023

work page 2023
[72]

Learning transferable visual models from natural language supervision,

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clarket al., “Learning transferable visual models from natural language supervision,” inInternational Conference on Machine Learning, 2021, pp. 8748–8763

work page 2021
[73]

T-CLAP: Temporal-enhanced contrastive language- audio pretraining,

Y . Yuan, Z. Chen, X. Liu, H. Liu, X. Xu, D. Jia, Y . Chen, M. D. Plumb- ley, and W. Wang, “T-CLAP: Temporal-enhanced contrastive language- audio pretraining,” inProceedings of IEEE International Workshop on Machine Learning for Signal Processing, 2024

work page 2024
[74]

Flowsep: Language-queried sound separation with rectified flow matching,

Y . Yuan, X. Liu, H. Liu, M. D. Plumbley, and W. Wang, “Flowsep: Language-queried sound separation with rectified flow matching,”arXiv preprint arXiv:2409.07614, 2024

work page arXiv 2024
[75]

Audiobox: Unified audio generation with natural language prompts.arXiv preprint arXiv:2312.15821, 2023

A. Vyas, B. Shi, M. Le, A. Tjandra, Y .-C. Wu, B. Guo, J. Zhang, X. Zhang, R. Adkins, W. Nganet al., “Audiobox: Unified audio generation with natural language prompts,”arXiv:2312.15821, 2023

work page arXiv 2023
[76]

Freesound technical demo,

F. Font, G. Roma, and X. Serra, “Freesound technical demo,” in Proceedings of the ACM International Conference on Multimedia, 2013, pp. 411–412

work page 2013
[77]

Deep convolutional neural networks and data augmentation for environmental sound classification,

J. Salamon and J. P. Bello, “Deep convolutional neural networks and data augmentation for environmental sound classification,”IEEE Signal Processing Letters, vol. 24, no. 3, pp. 279–283, 2017

work page 2017
[78]

ESC: dataset for environmental sound classification,

K. J. Piczak, “ESC: dataset for environmental sound classification,” in Proceedings of the ACM International Conference on Multimedia, 2015

work page 2015
[79]

Very Deep Convolutional Networks for Large-Scale Image Recognition

K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image Recognition,”arXiv:1409.1556, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[80]

PANNs: Large-scale pretrained audio neural networks for audio pattern recognition,

Q. Kong, Y . Cao, T. Iqbal, Y . Wang, W. Wang, and M. D. Plumbley, “PANNs: Large-scale pretrained audio neural networks for audio pattern recognition,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 2880–2894, 2020

work page 2020
[81]

Meta Audiobox Aesthetics: Unified Automatic Quality Assessment for Speech, Music, and Sound

A. Tjandra, Y .-C. Wu, B. Guo, J. Hoffman, B. Ellis, A. Vyas, B. Shi, S. Chen, M. Le, N. Zacharovet al., “Meta Audiobox Aesthetics: Unified automatic quality assessment for speech, music, and sound,” arXiv:2502.05139, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

Showing first 80 references.

[1] [1]

A comprehensive survey of AI-generated content: A history of generative AI from GAN to ChatGPT,

Y . Cao, S. Li, Y . Liu, Z. Yan, Y . Dai, P. S. Yu, and L. Sun, “A comprehensive survey of AI-generated content: A history of generative AI from GAN to ChatGPT,”arXiv:2303.04226, 2023

work page arXiv 2023

[2] [2]

AudioLDM: Text-to-Audio generation with latent diffusion models,

H. Liu, Z. Chen, Y . Yuan, X. Mei, X. Liu, D. Mandic, W. Wang, and M. D. Plumbley, “AudioLDM: Text-to-Audio generation with latent diffusion models,” inProceedings of the International Conference on Machine Learning, 2023, pp. 21 450–21 474

work page 2023

[3] [3]

Sound to visual scene generation by audio-to-visual latent alignment,

K. Sung-Bin, A. Senocak, H. Ha, A. Owens, and T.-H. Oh, “Sound to visual scene generation by audio-to-visual latent alignment,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 6430–6440

work page 2023

[4] [4]

I hear your true colors: Image guided audio gen- eration,

R. Sheffer and Y . Adi, “I hear your true colors: Image guided audio gen- eration,” inProceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, 2023

work page 2023

[5] [5]

FoleyGen: Visually-guided audio generation,

X. Mei, V . Nagaraja, G. Le Lan, Z. Ni, E. Chang, Y . Shi, and V . Chandra, “FoleyGen: Visually-guided audio generation,” inProceedings of IEEE International Workshop on Machine Learning for Signal Processing, 2024

work page 2024

[6] [6]

Taming visually guided sound generation,

V . Iashin and E. Rahtu, “Taming visually guided sound generation,” in Proceedings of British Machine Vision Conference, 2021

work page 2021

[7] [7]

AudioGen: Textually guided audio generation,

F. Kreuk, G. Synnaeve, A. Polyak, U. Singer, A. D ´efossez, J. Copet, D. Parikh, Y . Taigman, and Y . Adi, “AudioGen: Textually guided audio generation,”International Conference on Learning Representations, 2022

work page 2022

[8] [8]

Riffusion: Stable diffusion for real-time music generation,

S. Forsgren and H. Martiros, “Riffusion: Stable diffusion for real-time music generation,” 2022.[Online]. Available: https://riffusion.com/about

work page 2022

[9] [9]

WavJourney: Compositional audio creation with large language models,

X. Liu, Z. Zhu, H. Liu, Y . Yuan, M. Cui, Q. Huang, J. Liang, Y . Cao, Q. Kong, M. D. Plumbleyet al., “WavJourney: Compositional audio creation with large language models,”arXiv:2307.14335, 2023. 11

work page arXiv 2023

[10] [10]

Denoising diffusion implicit models,

J. Song, C. Meng, and S. Ermon, “Denoising diffusion implicit models,” inProceedings of International Conference on Learning Representa- tions, 2020

work page 2020

[11] [11]

Denoising diffusion probabilistic models,

J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” inAdvances in Neural Information Processing Systems, vol. 33, 2020, pp. 6840–6851

work page 2020

[12] [12]

AudioSet: An ontology and human-labeled dataset for audio events,

J. F. Gemmeke, D. P. W. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. C. Moore, M. Plakal, and M. Ritter, “AudioSet: An ontology and human-labeled dataset for audio events,” inProceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, 2017, pp. 776–780

work page 2017

[13] [13]

WavCaps: A ChatGPT-assisted weakly-labelled audio captioning dataset for audio-language multimodal research,

X. Mei, C. Meng, H. Liu, Q. Kong, T. Ko, C. Zhao, M. D. Plumbley, Y . Zou, and W. Wang, “WavCaps: A ChatGPT-assisted weakly-labelled audio captioning dataset for audio-language multimodal research,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 32, pp. 3339–3354, 2024

work page 2024

[14] [14]

A large-scale dataset for audio- language representation learning,

L. Sun, X. Xu, M. Wu, and W. Xie, “A large-scale dataset for audio- language representation learning,”arXiv:2309.11500, 2023

work page arXiv 2023

[15] [15]

Sound-VECaps: Improving audio generation with visual enhanced captions,

Y . Yuan, D. Jia, X. Zhuang, Y . Chen, Z. Liu, Z. Chen, Y . Wang, Y . Wang, X. Liu, X. Kanget al., “Sound-VECaps: Improving audio generation with visual enhanced captions,” inAudio Imagination: NeurIPS 2024 Workshop AI-Driven Speech, Music, and Sound Generation

work page 2024

[16] [16]

AudioGen: textually guided audio generation,

F. Kreuk, G. Synnaeve, A. Polyak, U. Singer, A. D ´efossez, J. Copet, D. Parikh, Y . Taigman, and Y . Adi, “AudioGen: textually guided audio generation,” inProceedings of International Conference on Learning Representations, 2023

work page 2023

[17] [17]

Diffsound: Discrete diffusion model for text-to-sound generation,

D. Yang, J. Yu, H. Wang, W. Wang, C. Weng, Y . Zou, and D. Yu, “Diffsound: Discrete diffusion model for text-to-sound generation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023

work page 2023

[18] [19]

Make-An-Audio 2: Temporal-enhanced text-to- audio generation,

J. Huang, Y . Ren, R. Huang, D. Yang, Z. Ye, C. Zhang, J. Liu, X. Yin, Z. Ma, and Z. Zhao, “Make-An-Audio 2: Temporal-enhanced text-to- audio generation,”arXiv:2305.18474, 2023

work page arXiv 2023

[19] [20]

Retrieval-augmented text-to-audio generation,

Y . Yuan, H. Liu, X. Liu, Q. Huang, M. D. Plumbley, and W. Wang, “Retrieval-augmented text-to-audio generation,” inProceedings of IEEE International Conference on Acoustics, Speech and Signal, 2024, pp. 581–585

work page 2024

[20] [21]

Audioldm 2: Learning holistic audio generation with self-supervised pretraining,

H. Liu, Y . Yuan, X. Liu, X. Mei, Q. Kong, Q. Tian, Y . Wang, W. Wang, Y . Wang, and M. D. Plumbley, “Audioldm 2: Learning holistic audio generation with self-supervised pretraining,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2024

work page 2024

[21] [22]

DreamBooth: Fine tuning text-to-image diffusion models for subject- driven generation,

N. Ruiz, Y . Li, V . Jampani, Y . Pritch, M. Rubinstein, and K. Aberman, “DreamBooth: Fine tuning text-to-image diffusion models for subject- driven generation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 22 500–22 510

work page 2023

[22] [23]

Zero-shot text-to-image generation,

A. Ramesh, M. Pavlov, G. Goh, S. Gray, C. V oss, A. Radford, M. Chen, and I. Sutskever, “Zero-shot text-to-image generation,” inProceedings of International Conference on Machine Learning, 2021, pp. 8821–8831

work page 2021

[23] [24]

Hierarchical Text-Conditional Image Generation with CLIP Latents

A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen, “Hierarchical text-conditional image generation with CLIP latents,”arXiv:2204.06125, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[24] [25]

Improving image generation with better captions,

J. Betker, G. Goh, L. Jing, T. Brooks, J. Wang, L. Li, L. Ouyang, J. Zhuang, J. Lee, Y . Guoet al., “Improving image generation with better captions,”Computer Science., vol. 2, no. 3, 2023

work page 2023

[25] [26]

High- resolution image synthesis with latent diffusion models,

R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High- resolution image synthesis with latent diffusion models,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion, 2022, pp. 10 684–10 695

work page 2022

[26] [27]

Scaling rectified flow transformers for high-resolution image synthesis,

P. Esser, S. Kulal, A. Blattmann, R. Entezari, J. M ¨uller, H. Saini, Y . Levi, D. Lorenz, A. Sauer, F. Boeselet al., “Scaling rectified flow transformers for high-resolution image synthesis,” inProceedings of International Conference on Machine Learning, 2024

work page 2024

[27] [28]

SoundCTM: Uniting score-based and consistency models for text-to-sound generation,

K. Saito, D. Kim, T. Shibuya, C.-H. Lai, Z. Zhong, Y . Takida, and Y . Mitsufuji, “SoundCTM: Uniting score-based and consistency models for text-to-sound generation,”arXiv:2405.18503, 2024

work page arXiv 2024

[28] [29]

Audiobox TTA- RAG: Improving zero-shot and few-shot text-to-audio with retrieval- augmented generation,

M. Yang, B. Shi, M. Le, W.-N. Hsu, and A. Tjandra, “Audiobox TTA- RAG: Improving zero-shot and few-shot text-to-audio with retrieval- augmented generation,”arXiv:2411.05141, 2024

work page arXiv 2024

[29] [30]

Adding conditional control to text-to-image diffusion models,

L. Zhang, A. Rao, and M. Agrawala, “Adding conditional control to text-to-image diffusion models,” inProceedings of IEEE International Conference on Computer Vision, 2023

work page 2023

[30] [31]

AudioCaps: Generating captions for audios in the wild,

C. D. Kim, B. Kim, H. Lee, and G. Kim, “AudioCaps: Generating captions for audios in the wild,” inProceedings of Conference of the North American Chapter of the Association for Computational Linguistics, 2019, pp. 119–132

work page 2019

[31] [32]

Score-based generative modeling through stochastic differen- tial equations,

Y . Song, J. Sohl-Dickstein, D. Kingma, A. Kumar, S. Ermon, and B. Poole, “Score-based generative modeling through stochastic differen- tial equations,” inProceedings of International Conference on Learning Representations, 2021

work page 2021

[32] [33]

Diffusion models beat GANs on image synthesis,

P. Dhariwal and A. Nichol, “Diffusion models beat GANs on image synthesis,” inAdvances in Neural Information Processing Systems, 2021

work page 2021

[33] [34]

Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding

C. Saharia, W. Chan, S. Saxena, L. Li, J. Whang, E. Denton, S. K. S. Ghasemipour, B. K. Ayan, S. S. Mahdavi, R. G. Lopes, T. Salimans, J. Ho, D. J. Fleet, and M. Norouzi, “Photorealistic text-to-image dif- fusion models with deep language understanding,”arXiv:2205.11487, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[34] [35]

Image super-resolution via iterative refinement,

C. Saharia, J. Ho, W. Chan, T. Salimans, D. J. Fleet, and M. Norouzi, “Image super-resolution via iterative refinement,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 4, pp. 4713– 4726, 2022

work page 2022

[35] [36]

Wave- Grad: Estimating gradients for waveform generation,

N. Chen, Y . Zhang, H. Zen, R. Weiss, M. Norouzi, and W. Chan, “Wave- Grad: Estimating gradients for waveform generation,” inProceedings of International Conference on Learning Representations, 2021

work page 2021

[36] [37]

DiffWave: A versatile diffusion model for audio synthesis,

Z. Kong, W. Ping, J. Huang, K. Zhao, and B. Catanzaro, “DiffWave: A versatile diffusion model for audio synthesis,” inInternational Confer- ence on Learning Representations, 2021

work page 2021

[37] [38]

Make-A-Video: Text-to-video generation without text-video data,

U. Singer, A. Polyak, T. Hayes, X. Yin, J. An, S. Zhang, Q. Hu, H. Yang, O. Ashual, O. Gafniet al., “Make-A-Video: Text-to-video generation without text-video data,” inProceedings of International Conference on Learning Representations, 2022

work page 2022

[38] [39]

Imagen Video: High Definition Video Generation with Diffusion Models

J. Ho, W. Chan, C. Saharia, J. Whang, R. Gao, A. Gritsenko, D. P. Kingma, B. Poole, M. Norouzi, D. J. Fleet, and T. Salimans, “Ima- gen Video: High definition video generation with diffusion models,” arXiv:2210.02303, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[39] [40]

Grad- TTS: A diffusion probabilistic model for text-to-speech,

V . Popov, I. V ovk, V . Gogoryan, T. Sadekova, and M. Kudinov, “Grad- TTS: A diffusion probabilistic model for text-to-speech,” inProceedings of International Conference on Machine Learning, 2021, pp. 8599–8608

work page 2021

[40] [41]

ResGrad: Residual denoising diffusion probabilistic models for text to speech,

Z. Chen, Y . Wu, Y . Leng, J. Chen, H. Liu, X. Tan, Y . Cui, K. Wang, L. He, S. Zhao, J. Bian, and D. Mandic, “ResGrad: Residual denoising diffusion probabilistic models for text to speech,”arXiv preprint:2212.14518, 2022

work page arXiv 2022

[41] [42]

Bilateral denoising diffusion models,

M. Lam, J. Wang, R. Huang, D. Su, and D. Yu, “Bilateral denoising diffusion models,” inInternational Conference on Learning Represen- tations, 2022

work page 2022

[42] [43]

Priorgrad: Improving conditional denoising diffu- sion models with data-driven adaptive prior,

S. Lee, H. Kim, C. Shin, X. Tan, C. Liu, Q. Meng, T. Qin, W. Chen, S. Yoon, and T. Liu, “Priorgrad: Improving conditional denoising diffu- sion models with data-driven adaptive prior,” inInternational Conference on Learning Representations, 2022

work page 2022

[43] [44]

InferGrad: Improving diffusion models for vocoder by considering inference in training,

Z. Chen, X. Tan, K. Wang, S. Pan, D. Mandic, L. He, and S. Zhao, “InferGrad: Improving diffusion models for vocoder by considering inference in training,” inProceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, 2022

work page 2022

[44] [45]

Acoustic scene generation with conditional SampleRNN,

Q. Kong, Y . Xu, T. Iqbal, Y . Cao, W. Wang, and M. D. Plumbley, “Acoustic scene generation with conditional SampleRNN,” inProceed- ings of IEEE International Conference on Acoustics, Speech and Signal Processing, 2019, pp. 925–929

work page 2019

[45] [46]

Conditional sound generation using neural discrete time-frequency representation learning,

X. Liu, T. Iqbal, J. Zhao, Q. Huang, M. Plumbley, and W. Wang, “Conditional sound generation using neural discrete time-frequency representation learning,”Proceedings of IEEE International Workshop on Machine Learning for Signal Processing, 2021

work page 2021

[46] [47]

Leveraging pre-trained AudioLDM for sound generation: A benchmark study,

Y . Yuan, H. Liu, J. Liang, X. Liu, M. D. Plumbley, and W. Wang, “Leveraging pre-trained AudioLDM for sound generation: A benchmark study,” inProceedings of European Association for Signal Processing, 2023

work page 2023

[47] [48]

HiFi-GAN: generative adversarial networks for efficient and high fidelity speech synthesis,

J. Kong, J. Kim, and J. Bae, “HiFi-GAN: generative adversarial networks for efficient and high fidelity speech synthesis,” inProceedings of the International Conference on Neural Information Processing Systems, 2020, pp. 17 022–17 033

work page 2020

[48] [49]

MelGAN: Generative adversarial networks for conditional waveform synthesis,

K. Kumar, R. Kumar, T. De Boissiere, L. Gestin, W. Z. Teoh, J. Sotelo, A. De Brebisson, Y . Bengio, and A. C. Courville, “MelGAN: Generative adversarial networks for conditional waveform synthesis,”Advances in Neural Information Processing Systems, vol. 32, 2019

work page 2019

[49] [50]

BigVGAN: A universal neural vocoder with large-scale training,

S.-g. Lee, W. Ping, B. Ginsburg, B. Catanzaro, and S. Yoon, “BigVGAN: A universal neural vocoder with large-scale training,” inProceedings of International Conference on Learning Representations, 2022

work page 2022

[50] [51]

Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation,

Y . Wu, K. Chen, T. Zhang, Y . Hui, T. Berg-Kirkpatrick, and S. Dubnov, “Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation,” inProceedings of IEEE Inter- national Conference on Acoustics, Speech and Signal Processing, 2023. 12

work page 2023

[51] [52]

Text-to-audio generation using instruction tuned LLM and latent diffusion model,

D. Ghosal, N. Majumder, A. Mehrish, and S. Poria, “Text-to-audio generation using instruction tuned LLM and latent diffusion model,” arXiv:2304.13731, 2023

work page arXiv 2023

[52] [53]

Exploring the limits of transfer learning with a unified text-to-text transformer,

C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y . Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer learning with a unified text-to-text transformer,”Journal of Machine Learning Research, vol. 21, no. 1, pp. 5485–5551, 2020

work page 2020

[53] [54]

Using pre-training can improve model robustness and uncertainty,

D. Hendrycks, K. Lee, and M. Mazeika, “Using pre-training can improve model robustness and uncertainty,” inProceedings of International Conference on Machine Learning, vol. 97, 2019, pp. 2712–2721

work page 2019

[54] [55]

Text-driven Foley sound generation with latent diffusion model,

Y . Yuan, H. Liu, X. Kang, P. Wu, M. D. Plumbley, and W. Wang, “Text-driven Foley sound generation with latent diffusion model,” in Proceedings of the Detection and Classification of Acoustic Scenes and Events Workshop, 2023, pp. 231–235

work page 2023

[55] [56]

Masactrl: Tuning-free mutual self-attention control for consistent image synthesis and editing,

M. Cao, X. Wang, Z. Qi, Y . Shan, X. Qie, and Y . Zheng, “Masactrl: Tuning-free mutual self-attention control for consistent image synthesis and editing,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 22 560–22 570

work page 2023

[56] [57]

Anydoor: Zero-shot object-level image customization,

X. Chen, L. Huang, Y . Liu, Y . Shen, D. Zhao, and H. Zhao, “Anydoor: Zero-shot object-level image customization,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 6593–6602

work page 2024

[57] [58]

Encoder-based domain tuning for fast personalization of text-to- image models,

R. Gal, M. Arar, Y . Atzmon, A. H. Bermano, G. Chechik, and D. Cohen- Or, “Encoder-based domain tuning for fast personalization of text-to- image models,”ACM Transactions on Graphics, vol. 42, no. 4, 2023

work page 2023

[58] [59]

Imagic: Text-based real image editing with diffusion models,

B. Kawar, S. Zada, O. Lang, O. Tov, H. Chang, T. Dekel, I. Mosseri, and M. Irani, “Imagic: Text-based real image editing with diffusion models,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 6007–6017

work page 2023

[59] [60]

Key-locked rank one editing for text-to-image personalization,

Y . Tewel, R. Gal, G. Chechik, and Y . Atzmon, “Key-locked rank one editing for text-to-image personalization,” inProceedings of ACM SIGGRAPH, 2023, pp. 1–11

work page 2023

[60] [61]

An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion

R. Gal, Y . Alaluf, Y . Atzmon, O. Patashnik, A. H. Bermano, G. Chechik, and D. Cohen-Or, “An image is worth one word: Personalizing text-to- image generation using textual inversion,”arXiv:2208.01618, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[61] [62]

arXiv preprint arXiv:2303.09522 (2023)

A. V oynov, Q. Chu, D. Cohen-Or, and K. Aberman, “P+: Extended textual conditioning in text-to-image generation,”arXiv:2303.09522, 2023

work page arXiv 2023

[62] [63]

A neural space- time representation for text-to-image personalization,

Y . Alaluf, E. Richardson, G. Metzer, and D. Cohen-Or, “A neural space- time representation for text-to-image personalization,”ACM Transac- tions on Graphics, vol. 42, no. 6, 2023

work page 2023

[63] [64]

Improving expressivity of GNNs with subgraph- specific factor embedded normalization,

K. Chen, S. Liu, T. Zhu, J. Qiao, Y . Su, Y . Tian, T. Zheng, H. Zhang, Z. Feng, J. Yeet al., “Improving expressivity of GNNs with subgraph- specific factor embedded normalization,” inProceedings of the ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2023, pp. 237–249

work page 2023

[64] [65]

BLIP-Diffusion: Pre-trained subject represen- tation for controllable text-to-image generation and editing,

D. Li, J. Li, and S. Hoi, “BLIP-Diffusion: Pre-trained subject represen- tation for controllable text-to-image generation and editing,”Advances in Neural Information Processing Systems, vol. 36, 2024

work page 2024

[65] [66]

Cones: Concept neurons in diffusion models for customized generation,

Z. Liu, R. Feng, K. Zhu, Y . Zhang, K. Zheng, Y . Liu, D. Zhao, J. Zhou, and Y . Cao, “Cones: Concept neurons in diffusion models for customized generation,”arXiv:2303.05125, 2023

work page arXiv 2023

[66] [67]

HyperDreamBooth: Hypernetworks for fast personalization of text-to-image models,

N. Ruiz, Y . Li, V . Jampani, W. Wei, T. Hou, Y . Pritch, N. Wadhwa, M. Rubinstein, and K. Aberman, “HyperDreamBooth: Hypernetworks for fast personalization of text-to-image models,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 6527–6536

work page 2024

[67] [68]

Elite: Encoding visual concepts into textual embeddings for customized text-to-image generation,

Y . Wei, Y . Zhang, Z. Ji, J. Bai, L. Zhang, and W. Zuo, “Elite: Encoding visual concepts into textual embeddings for customized text-to-image generation,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 15 943–15 953

work page 2023

[68] [69]

FreeCustom: Tuning-free customized image generation for multi- concept composition,

G. Ding, C. Zhao, W. Wang, Z. Yang, Z. Liu, H. Chen, and C. Shen, “FreeCustom: Tuning-free customized image generation for multi- concept composition,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 9089–9098

work page 2024

[69] [70]

KNN-Diffusion: Image generation via large-scale retrieval,

S. Sheynin, O. Ashual, A. Polyak, U. Singer, O. Gafni, E. Nachmani, and Y . Taigman, “KNN-Diffusion: Image generation via large-scale retrieval,” inProceedings of International Conference on Learning Representations, 2023

work page 2023

[70] [71]

Re-Imagen: Retrieval- augmented text-to-image generator,

W. Chen, H. Hu, C. Saharia, and W. W. Cohen, “Re-Imagen: Retrieval- augmented text-to-image generator,” inProceedings of International Conference on Learning Representations, 2023

work page 2023

[71] [72]

Learning transferable visual models from natural language supervision,

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clarket al., “Learning transferable visual models from natural language supervision,” inInternational Conference on Machine Learning, 2021, pp. 8748–8763

work page 2021

[72] [73]

T-CLAP: Temporal-enhanced contrastive language- audio pretraining,

Y . Yuan, Z. Chen, X. Liu, H. Liu, X. Xu, D. Jia, Y . Chen, M. D. Plumb- ley, and W. Wang, “T-CLAP: Temporal-enhanced contrastive language- audio pretraining,” inProceedings of IEEE International Workshop on Machine Learning for Signal Processing, 2024

work page 2024

[73] [74]

Flowsep: Language-queried sound separation with rectified flow matching,

Y . Yuan, X. Liu, H. Liu, M. D. Plumbley, and W. Wang, “Flowsep: Language-queried sound separation with rectified flow matching,”arXiv preprint arXiv:2409.07614, 2024

work page arXiv 2024

[74] [75]

Audiobox: Unified audio generation with natural language prompts.arXiv preprint arXiv:2312.15821, 2023

A. Vyas, B. Shi, M. Le, A. Tjandra, Y .-C. Wu, B. Guo, J. Zhang, X. Zhang, R. Adkins, W. Nganet al., “Audiobox: Unified audio generation with natural language prompts,”arXiv:2312.15821, 2023

work page arXiv 2023

[75] [76]

Freesound technical demo,

F. Font, G. Roma, and X. Serra, “Freesound technical demo,” in Proceedings of the ACM International Conference on Multimedia, 2013, pp. 411–412

work page 2013

[76] [77]

Deep convolutional neural networks and data augmentation for environmental sound classification,

J. Salamon and J. P. Bello, “Deep convolutional neural networks and data augmentation for environmental sound classification,”IEEE Signal Processing Letters, vol. 24, no. 3, pp. 279–283, 2017

work page 2017

[77] [78]

ESC: dataset for environmental sound classification,

K. J. Piczak, “ESC: dataset for environmental sound classification,” in Proceedings of the ACM International Conference on Multimedia, 2015

work page 2015

[78] [79]

Very Deep Convolutional Networks for Large-Scale Image Recognition

K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image Recognition,”arXiv:1409.1556, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014

[79] [80]

PANNs: Large-scale pretrained audio neural networks for audio pattern recognition,

Q. Kong, Y . Cao, T. Iqbal, Y . Wang, W. Wang, and M. D. Plumbley, “PANNs: Large-scale pretrained audio neural networks for audio pattern recognition,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 2880–2894, 2020

work page 2020

[80] [81]

Meta Audiobox Aesthetics: Unified Automatic Quality Assessment for Speech, Music, and Sound

A. Tjandra, Y .-C. Wu, B. Guo, J. Hoffman, B. Ellis, A. Vyas, B. Shi, S. Chen, M. Le, N. Zacharovet al., “Meta Audiobox Aesthetics: Unified automatic quality assessment for speech, music, and sound,” arXiv:2502.05139, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025