AudioLM: a Language Modeling Approach to Audio Generation

Damien Vincent; David Grangier; Dominik Roblek; Eugene Kharitonov; Marco Tagliasacchi; Matt Sharifi; Neil Zeghidour; Olivier Pietquin; Olivier Teboul; Rapha\"el Marinier

arxiv: 2209.03143 · v2 · pith:TSHGVIVBnew · submitted 2022-09-07 · 💻 cs.SD · cs.LG· eess.AS

AudioLM: a Language Modeling Approach to Audio Generation

Zal\'an Borsos , Rapha\"el Marinier , Damien Vincent , Eugene Kharitonov , Olivier Pietquin , Matt Sharifi , Dominik Roblek , Olivier Teboul

show 3 more authors

David Grangier Marco Tagliasacchi Neil Zeghidour

This is my paper

classification 💻 cs.SD cs.LGeess.AS

keywords audioaudiolmcontinuationsgenerationlanguagelong-termspeechachieve

0 comments

read the original abstract

We introduce AudioLM, a framework for high-quality audio generation with long-term consistency. AudioLM maps the input audio to a sequence of discrete tokens and casts audio generation as a language modeling task in this representation space. We show how existing audio tokenizers provide different trade-offs between reconstruction quality and long-term structure, and we propose a hybrid tokenization scheme to achieve both objectives. Namely, we leverage the discretized activations of a masked language model pre-trained on audio to capture long-term structure and the discrete codes produced by a neural audio codec to achieve high-quality synthesis. By training on large corpora of raw audio waveforms, AudioLM learns to generate natural and coherent continuations given short prompts. When trained on speech, and without any transcript or annotation, AudioLM generates syntactically and semantically plausible speech continuations while also maintaining speaker identity and prosody for unseen speakers. Furthermore, we demonstrate how our approach extends beyond speech by generating coherent piano music continuations, despite being trained without any symbolic representation of music.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 21 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

MusicLM: Generating Music From Text
cs.SD 2023-01 conditional novelty 8.0

MusicLM produces coherent multi-minute 24 kHz music from text prompts using hierarchical sequence-to-sequence modeling and outperforms prior systems in quality and text adherence.
Codec-Robust Attacks on Audio LLMs
cs.SD 2026-05 unverdicted novelty 7.0

CodecAttack perturbs audio in codec latent space with multi-bitrate EoT to achieve 85.5% average ASR on Opus-compressed Audio LLMs versus under 26% for waveform baselines, with transfer to MP3 and AAC.
Nautilus: An Auto-Scheduling Tensor Compiler for Efficient Tiled GPU Kernels
cs.PL 2026-04 unverdicted novelty 7.0

Nautilus auto-compiles math-like tensor descriptions into optimized GPU kernels, delivering up to 42% higher throughput than prior compilers on transformer models across NVIDIA GPUs.
TSVer: A Benchmark for Fact Verification Against Time-Series Evidence
cs.CL 2025-11 unverdicted novelty 7.0

TSVer is a new benchmark dataset for fact verification against time-series evidence, with 304 annotated real-world claims, 400 time series, verdicts, and justifications, plus baseline results showing current models struggle.
Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers
cs.CL 2023-01 unverdicted novelty 7.0

VALL-E is a neural codec language model trained on 60K hours of speech that performs zero-shot TTS, synthesizing natural speech that matches an unseen speaker's voice, emotion, and environment from a 3-second prompt.
Hidden in Plain Tokens: Simply Robust, Gradient-Free Watermark for Synthetic Audio
cs.LG 2026-05 unverdicted novelty 6.0

A training-free audio watermarking method that reduces vocabulary via community detection to boost detection robustness by orders of magnitude while resisting audio modifications.
Codec-Robust Attacks on Audio LLMs
cs.SD 2026-05 unverdicted novelty 6.0

CodecAttack optimizes perturbations in neural audio codec latent space to reach 85.5% average target-substring ASR on compressed Opus audio while waveform baselines stay below 26%.
SpidR-Adapt: A Universal Speech Representation Model for Few-Shot Adaptation
cs.CL 2025-12 unverdicted novelty 6.0

SpidR-Adapt uses meta-learning with a first-order bi-level optimization heuristic to adapt speech representations to new languages with less than 1 hour of data, achieving 100x better efficiency than standard training.
CodecSep: Prompt-Driven Universal Sound Separation on Neural Audio Codec Latents
cs.SD 2025-09 unverdicted novelty 6.0

CodecSep performs prompt-driven universal sound separation directly in neural audio codec latents by combining a frozen DAC backbone with a lightweight FiLM-conditioned Transformer masker driven by CLAP embeddings, yi...
AudioPaLM: A Large Language Model That Can Speak and Listen
cs.CL 2023-06 unverdicted novelty 6.0

AudioPaLM unifies PaLM-2 and AudioLM to outperform prior systems on speech translation while enabling zero-shot speech-to-text for many unseen language pairs and voice transfer from short prompts.
Shap-E: Generating Conditional 3D Implicit Functions
cs.CV 2023-05 accept novelty 6.0

Shap-E encodes 3D assets into implicit function parameters then uses a conditional diffusion model to generate new ones from text, enabling fast multi-representation 3D asset creation.
F3-Tokenizer: Taming Audio Autoencoder Latents for Understanding and Generation
cs.SD 2026-06 unverdicted novelty 5.0

F3-Tokenizer adapts audio autoencoder latents with noise-regularized bottleneck (channel normalization and stochastic perturbation) and a representation encoder (RQ-MTP plus frozen-LLM supervision) to support both hig...
How Optimality Structures Sparse Dictionaries: A Theory for Understanding SAE Representations
q-bio.NC 2026-06 unverdicted novelty 5.0

Derives optimality constraints for nonnegative joint dictionary learning that explain observed SAE behaviors such as feature splitting, absorption, and dense antipodal features.
Drum Synthesis from Expressive Drum Grids via Neural Audio Codecs
cs.SD 2026-05 unverdicted novelty 5.0

A Transformer predicts tokens from neural audio codecs (EnCodec, DAC, X-Codec) to convert expressive drum grids into audio, trained and evaluated on the E-GMD dataset using objective metrics.
HAFM: Hierarchical Autoregressive Foundation Model for Music Accompaniment Generation
cs.SD 2026-04 unverdicted novelty 5.0

HAFM uses a hierarchical autoregressive model with dual-rate HuBERT and EnCodec tokens to generate coherent instrumental music from vocals, achieving FAD 2.08 on MUSDB18 while matching prior systems with fewer parameters.
Expectation and Acoustic Neural Network Representations Enhance Music Identification from Brain Activity
cs.AI 2026-03 unverdicted novelty 5.0

Separating acoustic and expectation ANN representations as teacher targets improves EEG music identification beyond baselines and seed ensembles.
A Simple Method to Enhance Pre-trained Language Models with Speech Tokens for Classification
cs.CL 2025-12 unverdicted novelty 5.0

Lasso-selected speech tokens enhance text LLMs for multimodal classification by reducing long audio sequences to task-relevant features via self-supervised adaptation.
Continuous diffusion for categorical data
cs.CL 2022-11 unverdicted novelty 5.0

The paper proposes CDCD, a continuous-time and continuous-space diffusion framework for categorical data, and reports results on language modeling tasks.
Musical Attention Transformer: Music Generation Using a Music-Specific Attention Model
cs.SD 2026-05 unverdicted novelty 4.0

The paper introduces Musical Attention, an attention variant that incorporates eight musical features including metadata to generate more coherent and varied music than standard or strided attention baselines.
ModelScope Text-to-Video Technical Report
cs.CV 2023-08 unverdicted novelty 4.0

ModelScopeT2V is a 1.7-billion-parameter text-to-video model built on Stable Diffusion that adds temporal modeling and outperforms prior methods on three evaluation metrics.
Telephony Voice Agent for Banking Services
cs.HC 2026-06 unverdicted novelty 2.0

Implementation of a telephony voice agent for banking services using Dialogflow CX supporting queries, authentication, and live agent handoff.