AudioGen: Textually Guided Audio Generation

Adam Polyak; Alexandre D\'efossez; Devi Parikh; Felix Kreuk; Gabriel Synnaeve; Jade Copet; Uriel Singer; Yaniv Taigman; Yossi Adi

arxiv: 2209.15352 · v2 · pith:CGLOUNM4new · submitted 2022-09-30 · 💻 cs.SD · cs.CL· cs.LG· eess.AS

AudioGen: Textually Guided Audio Generation

Felix Kreuk , Gabriel Synnaeve , Adam Polyak , Uriel Singer , Alexandre D\'efossez , Jade Copet , Devi Parikh , Yaniv Taigman

show 1 more author

Yossi Adi

This is my paper

classification 💻 cs.SD cs.CLcs.LGeess.AS

keywords audiotextaudiogensamplesmultipleabilityannotationschallenges

0 comments

read the original abstract

We tackle the problem of generating audio samples conditioned on descriptive text captions. In this work, we propose AaudioGen, an auto-regressive generative model that generates audio samples conditioned on text inputs. AudioGen operates on a learnt discrete audio representation. The task of text-to-audio generation poses multiple challenges. Due to the way audio travels through a medium, differentiating ``objects'' can be a difficult task (e.g., separating multiple people simultaneously speaking). This is further complicated by real-world recording conditions (e.g., background noise, reverberation, etc.). Scarce text annotations impose another constraint, limiting the ability to scale models. Finally, modeling high-fidelity audio requires encoding audio at high sampling rate, leading to extremely long sequences. To alleviate the aforementioned challenges we propose an augmentation technique that mixes different audio samples, driving the model to internally learn to separate multiple sources. We curated 10 datasets containing different types of audio and text annotations to handle the scarcity of text-audio data points. For faster inference, we explore the use of multi-stream modeling, allowing the use of shorter sequences while maintaining a similar bitrate and perceptual quality. We apply classifier-free guidance to improve adherence to text. Comparing to the evaluated baselines, AudioGen outperforms over both objective and subjective metrics. Finally, we explore the ability of the proposed method to generate audio continuation conditionally and unconditionally. Samples: https://felixkreuk.github.io/audiogen

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 27 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

DTM-Codec: Dynamic Token Masking for VFR Speech Coding with Efficient Boundary Selection
eess.AS 2026-06 unverdicted novelty 7.0

DTM-Codec achieves better reconstruction quality and intelligibility than fixed-frame-rate neural speech codecs at matched total bitrate via dynamic token masking and Path Length Equalization for variable frame rates.
AudioCALM: Continuous Autoregressive Language Modeling for Universal Audio Generation
eess.AS 2026-06 unverdicted novelty 7.0

AudioCALM presents a continuous autoregressive framework with flow-matching prediction and A-MoME architecture that unifies speech, sound, and music generation while matching modality-specific state-of-the-art performance.
HoliDubber: Holistic Video Dubbing for Complex Acoustic Scenes via Text-Guided Audio Synthesis
eess.AS 2026-06 unverdicted novelty 7.0

HoliDubber introduces a patch-based autoregressive diffusion transformer for joint text-guided synthesis of speech and ambient audio in video dubbing, with a new benchmark showing outperformance over prior speech-only...
Exploring LLMs for South Asian Music Understanding and Generation
cs.SD 2026-06 unverdicted novelty 7.0

This paper introduces a 504-question benchmark for South Asian music understanding and a controlled prompting framework for generation, reporting frontier LLMs at 85-90% on understanding but only 40% stylistic faithfu...
HapticLDM: A Diffusion Model for Text-to-Vibrotactile Generation
cs.HC 2026-05 unverdicted novelty 7.0

HapticLDM is the first latent diffusion model that generates vibrotactile signals directly from text, using dynamic text curation and global denoising to improve realism and semantic alignment over autoregressive baselines.
PairAlign: A Framework for Sequence Tokenization via Self-Alignment with Applications to Audio Tokenization
cs.LG 2026-05 unverdicted novelty 7.0

PairAlign learns compact audio token sequences via self-alignment of paired content views using an autoregressive decoder, achieving strong cross-view consistency and edit-distance preservation while reducing token co...
PairAlign: A Framework for Sequence Tokenization via Self-Alignment with Applications to Audio Tokenization
cs.LG 2026-05 unverdicted novelty 7.0

PairAlign learns compact variable-length token sequences for audio via self-alignment on paired content-preserving views, achieving 55% fewer archive tokens than VQ while preserving edit-distance retrieval at 12.71 tokens/s.
FoleyDesigner: Immersive Stereo Foley Generation with Precise Spatio-Temporal Alignment for Film Clips
cs.CV 2026-04 unverdicted novelty 7.0

FoleyDesigner generates spatio-temporally aligned stereo Foley audio for film clips via multi-agent analysis, diffusion models on video cues, and LLM mixing, supported by the new FilmStereo dataset.
Omni2Sound: Towards Unified Video-Text-to-Audio Generation
cs.SD 2026-01 unverdicted novelty 7.0

A single DiT-based diffusion model unifies video-to-audio, text-to-audio, and joint video-text-to-audio generation, supported by a new 470k-pair dataset and three-stage progressive training that resolves task competition.
AudioMoG: Guiding Audio Generation with Mixture-of-Guidance
cs.SD 2025-09 unverdicted novelty 7.0

AudioMoG is a mixture-of-guidance sampling technique that combines CFG and AG signals to outperform single-guidance baselines in text-to-audio generation at equivalent speed.
WavFlow: Audio Generation in Waveform Space
cs.SD 2026-05 conditional novelty 6.0

WavFlow performs direct waveform audio generation via flow matching on 2D token grids from raw patches plus amplitude lifting, matching latent-based methods on VGGSound and AudioCaps without intermediate compression.
Stage-adaptive audio diffusion modeling
cs.SD 2026-05 unverdicted novelty 6.0

A semantic progress signal from SSL discrepancy slope enables three stage-aware mechanisms that improve training efficiency and performance in audio diffusion models over static baselines.
Audio-Omni: Extending Multi-modal Understanding to Versatile Audio Generation and Editing
cs.SD 2026-04 unverdicted novelty 6.0

Audio-Omni unifies audio understanding, generation, and editing in one end-to-end model across domains, backed by a new million-pair AudioEdit dataset, and achieves strong benchmark results.
Language-Guided Multimodal Texture Authoring via Generative Models
cs.HC 2026-04 unverdicted novelty 6.0

A language-driven system generates semantically consistent multimodal textures from text prompts by linking autoregressive haptic models and diffusion-based visuals through a shared latent representation.
FoleyDirector: Fine-Grained Temporal Steering for Video-to-Audio Generation via Structured Scripts
cs.SD 2026-03 unverdicted novelty 6.0

FoleyDirector introduces structured temporal scripts and a fusion module to enable precise timing control in DiT-based video-to-audio generation while preserving audio fidelity.
Language Models as Semantic Teachers: Post-Training Alignment for Medical Audio Understanding
cs.SD 2025-12 unverdicted novelty 6.0

AcuLa aligns audio models with medical language models via contrastive and self-supervised objectives on LLM-generated clinical reports, raising mean AUROC from 0.68 to 0.79 across 18 cardio-respiratory tasks.
Two-Dimensional Quantization for Geometry-Aware Audio Coding
cs.SD 2025-12 unverdicted novelty 6.0

Q2D2 uses 2D geometric grid projections to quantize feature pairs in neural audio codecs, yielding implicit codebooks that improve efficiency and utilization over RVQ, VQ, and FSQ while maintaining reconstruction quality.
DGSNA: Dynamic Generative Scene-based Noise Addition method
cs.SD 2024-11 unverdicted novelty 6.0

DGSNA dynamically generates scene-specific noise via prompt-driven language models and text-to-audio diffusion, then mixes it with speech to improve recognition and keyword spotting robustness by up to 11.32%.
AudioPaLM: A Large Language Model That Can Speak and Listen
cs.CL 2023-06 unverdicted novelty 6.0

AudioPaLM unifies PaLM-2 and AudioLM to outperform prior systems on speech translation while enabling zero-shot speech-to-text for many unseen language pairs and voice transfer from short prompts.
Shap-E: Generating Conditional 3D Implicit Functions
cs.CV 2023-05 accept novelty 6.0

Shap-E encodes 3D assets into implicit function parameters then uses a conditional diffusion model to generate new ones from text, enabling fast multi-representation 3D asset creation.
AudioX-Turbo: A Unified Framework for Efficient Anything-to-Audio Generation
cs.SD 2026-06 unverdicted novelty 5.0

AudioX-Turbo distills a Multimodal Diffusion Transformer into a 4-step student model for efficient multimodal anything-to-audio generation, trained on a new 9.2M-sample dataset IF-caps-Pro.
From Prompts to Context: An Ontology-Driven Framework for Human-Generative AI Collaboration
cs.HC 2026-05 unverdicted novelty 5.0

Presents the CCAI ontology and SPARQL retrieval method to convert ephemeral Human-Generative AI prompt interactions into explicit, machine-readable collaboration traces, illustrated in a competency-profile software ca...
Woosh: A Sound Effects Foundation Model
cs.SD 2026-04 accept novelty 5.0

Woosh is a new publicly released foundation model optimized for high-quality sound effect generation from text or video, showing competitive or better results than open alternatives like Stable Audio Open.
Movie Gen: A Cast of Media Foundation Models
cs.CV 2024-10 unverdicted novelty 5.0

A 30B-parameter transformer and related models generate high-quality videos and audio, claiming state-of-the-art results on text-to-video, video editing, personalization, and audio generation tasks.
STAR-VAE: Structured Topology-Aware Regularization for Audio Reconstruction and Generation
eess.AS 2026-06 unverdicted novelty 4.0

STAR-VAE introduces topology-aware regularization to reshape VAE latent geometry for audio, claiming to resolve the Rate-Distortion-Regularity Trilemma and achieve SOTA reconstruction.
Quantifying the Climate Risk of Generative AI: Region-Aware Carbon Accounting with G-TRACE and the AI Sustainability Pyramid
cs.CY 2025-11 unverdicted novelty 4.0

G-TRACE quantifies region-aware GenAI emissions and estimates 4,309 MWh energy use plus 2,068 tCO2 from the Ghibli-style image generation trend, paired with the AI Sustainability Pyramid for translating metrics into policy.
Quantifying the Climate Risk of Generative AI: Region-Aware Carbon Accounting with G-TRACE and the AI Sustainability Pyramid
cs.CY 2025-11 unverdicted novelty 4.0

G-TRACE provides region-aware estimates of GenAI carbon emissions including 4309 MWh and 2068 tCO2 for a 2024-2025 image generation trend, paired with a seven-level AI Sustainability Pyramid for policy guidance.