AudioLM: a Language Modeling Approach to Audio Generation
read the original abstract
We introduce AudioLM, a framework for high-quality audio generation with long-term consistency. AudioLM maps the input audio to a sequence of discrete tokens and casts audio generation as a language modeling task in this representation space. We show how existing audio tokenizers provide different trade-offs between reconstruction quality and long-term structure, and we propose a hybrid tokenization scheme to achieve both objectives. Namely, we leverage the discretized activations of a masked language model pre-trained on audio to capture long-term structure and the discrete codes produced by a neural audio codec to achieve high-quality synthesis. By training on large corpora of raw audio waveforms, AudioLM learns to generate natural and coherent continuations given short prompts. When trained on speech, and without any transcript or annotation, AudioLM generates syntactically and semantically plausible speech continuations while also maintaining speaker identity and prosody for unseen speakers. Furthermore, we demonstrate how our approach extends beyond speech by generating coherent piano music continuations, despite being trained without any symbolic representation of music.
This paper has not been read by Pith yet.
Forward citations
Cited by 21 Pith papers
-
MusicLM: Generating Music From Text
MusicLM produces coherent multi-minute 24 kHz music from text prompts using hierarchical sequence-to-sequence modeling and outperforms prior systems in quality and text adherence.
-
Codec-Robust Attacks on Audio LLMs
CodecAttack perturbs audio in codec latent space with multi-bitrate EoT to achieve 85.5% average ASR on Opus-compressed Audio LLMs versus under 26% for waveform baselines, with transfer to MP3 and AAC.
-
Nautilus: An Auto-Scheduling Tensor Compiler for Efficient Tiled GPU Kernels
Nautilus auto-compiles math-like tensor descriptions into optimized GPU kernels, delivering up to 42% higher throughput than prior compilers on transformer models across NVIDIA GPUs.
-
TSVer: A Benchmark for Fact Verification Against Time-Series Evidence
TSVer is a new benchmark dataset for fact verification against time-series evidence, with 304 annotated real-world claims, 400 time series, verdicts, and justifications, plus baseline results showing current models struggle.
-
Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers
VALL-E is a neural codec language model trained on 60K hours of speech that performs zero-shot TTS, synthesizing natural speech that matches an unseen speaker's voice, emotion, and environment from a 3-second prompt.
-
Hidden in Plain Tokens: Simply Robust, Gradient-Free Watermark for Synthetic Audio
A training-free audio watermarking method that reduces vocabulary via community detection to boost detection robustness by orders of magnitude while resisting audio modifications.
-
Codec-Robust Attacks on Audio LLMs
CodecAttack optimizes perturbations in neural audio codec latent space to reach 85.5% average target-substring ASR on compressed Opus audio while waveform baselines stay below 26%.
-
SpidR-Adapt: A Universal Speech Representation Model for Few-Shot Adaptation
SpidR-Adapt uses meta-learning with a first-order bi-level optimization heuristic to adapt speech representations to new languages with less than 1 hour of data, achieving 100x better efficiency than standard training.
-
CodecSep: Prompt-Driven Universal Sound Separation on Neural Audio Codec Latents
CodecSep performs prompt-driven universal sound separation directly in neural audio codec latents by combining a frozen DAC backbone with a lightweight FiLM-conditioned Transformer masker driven by CLAP embeddings, yi...
-
AudioPaLM: A Large Language Model That Can Speak and Listen
AudioPaLM unifies PaLM-2 and AudioLM to outperform prior systems on speech translation while enabling zero-shot speech-to-text for many unseen language pairs and voice transfer from short prompts.
-
Shap-E: Generating Conditional 3D Implicit Functions
Shap-E encodes 3D assets into implicit function parameters then uses a conditional diffusion model to generate new ones from text, enabling fast multi-representation 3D asset creation.
-
F3-Tokenizer: Taming Audio Autoencoder Latents for Understanding and Generation
F3-Tokenizer adapts audio autoencoder latents with noise-regularized bottleneck (channel normalization and stochastic perturbation) and a representation encoder (RQ-MTP plus frozen-LLM supervision) to support both hig...
-
How Optimality Structures Sparse Dictionaries: A Theory for Understanding SAE Representations
Derives optimality constraints for nonnegative joint dictionary learning that explain observed SAE behaviors such as feature splitting, absorption, and dense antipodal features.
-
Drum Synthesis from Expressive Drum Grids via Neural Audio Codecs
A Transformer predicts tokens from neural audio codecs (EnCodec, DAC, X-Codec) to convert expressive drum grids into audio, trained and evaluated on the E-GMD dataset using objective metrics.
-
HAFM: Hierarchical Autoregressive Foundation Model for Music Accompaniment Generation
HAFM uses a hierarchical autoregressive model with dual-rate HuBERT and EnCodec tokens to generate coherent instrumental music from vocals, achieving FAD 2.08 on MUSDB18 while matching prior systems with fewer parameters.
-
Expectation and Acoustic Neural Network Representations Enhance Music Identification from Brain Activity
Separating acoustic and expectation ANN representations as teacher targets improves EEG music identification beyond baselines and seed ensembles.
-
A Simple Method to Enhance Pre-trained Language Models with Speech Tokens for Classification
Lasso-selected speech tokens enhance text LLMs for multimodal classification by reducing long audio sequences to task-relevant features via self-supervised adaptation.
-
Continuous diffusion for categorical data
The paper proposes CDCD, a continuous-time and continuous-space diffusion framework for categorical data, and reports results on language modeling tasks.
-
Musical Attention Transformer: Music Generation Using a Music-Specific Attention Model
The paper introduces Musical Attention, an attention variant that incorporates eight musical features including metadata to generate more coherent and varied music than standard or strided attention baselines.
-
ModelScope Text-to-Video Technical Report
ModelScopeT2V is a 1.7-billion-parameter text-to-video model built on Stable Diffusion that adds temporal modeling and outperforms prior methods on three evaluation metrics.
-
Telephony Voice Agent for Banking Services
Implementation of a telephony voice agent for banking services using Dialogflow CX supporting queries, authentication, and live agent handoff.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.