MusicLM produces coherent multi-minute 24 kHz music from text prompts using hierarchical sequence-to-sequence modeling and outperforms prior systems in quality and text adherence.
hub Mixed citations
High Fidelity Neural Audio Compression
Mixed citation behavior. Most common role is background (67%).
abstract
We introduce a state-of-the-art real-time, high-fidelity, audio codec leveraging neural networks. It consists in a streaming encoder-decoder architecture with quantized latent space trained in an end-to-end fashion. We simplify and speed-up the training by using a single multiscale spectrogram adversary that efficiently reduces artifacts and produce high-quality samples. We introduce a novel loss balancer mechanism to stabilize training: the weight of a loss now defines the fraction of the overall gradient it should represent, thus decoupling the choice of this hyper-parameter from the typical scale of the loss. Finally, we study how lightweight Transformer models can be used to further compress the obtained representation by up to 40%, while staying faster than real time. We provide a detailed description of the key design choices of the proposed model including: training objective, architectural changes and a study of various perceptual loss functions. We present an extensive subjective evaluation (MUSHRA tests) together with an ablation study for a range of bandwidths and audio domains, including speech, noisy-reverberant speech, and music. Our approach is superior to the baselines methods across all evaluated settings, considering both 24 kHz monophonic and 48 kHz stereophonic audio. Code and models are available at github.com/facebookresearch/encodec.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
Neural reconstruction losses in VAEs reduce latent information content and produce more isotropic latent geometries with even uncertainty distribution.
CodecAttack perturbs audio in codec latent space with multi-bitrate EoT to achieve 85.5% average ASR on Opus-compressed Audio LLMs versus under 26% for waveform baselines, with transfer to MP3 and AAC.
BandTok tokenizes Mel-spectrograms as independent time-frequency band tokens from a single codebook and pairs it with 2D RoPE in an autoregressive model to improve music generation over residual multi-codebook tokenizers.
AffectCodec is an emotion-guided neural speech codec that preserves emotional cues during quantization while maintaining semantic fidelity and prosodic naturalness.
VITA-QinYu is the first expressive end-to-end spoken language model supporting role-playing and singing alongside conversation, trained on 15.8K hours of data and outperforming prior models on expressiveness and conversational benchmarks.
Introduces the Indic-CodecFake dataset for Indic codec deepfakes and SATYAM, a novel hyperbolic ALM that outperforms baselines through dual-stage semantic-prosodic fusion using Bhattacharya distance.
LRLspoof corpus and threshold-transfer evaluation demonstrate that spoof detection performance varies markedly across languages, identifying language as an independent domain shift factor.
MusicRFM discovers interpretable concept directions in music model hidden states using RFM probes and injects them at inference to steer generation toward desired musical properties without retraining.
A new shared video-image tokenizer enables large language models to surpass diffusion models on standard visual generation benchmarks.
VALL-E is a neural codec language model trained on 60K hours of speech that performs zero-shot TTS, synthesizing natural speech that matches an unseen speaker's voice, emotion, and environment from a 3-second prompt.
DiffErase removes black-box audio watermarks via diffusion priors by adding intermediate noise and regenerating with a pretrained model, preserving quality across audio domains.
The paper introduces target-KL regularization to train audio VAEs at specific bitrates, enabling rate-distortion curves and comparison to discrete audio codecs for improved text-to-sound generation.
A plug-and-play differentiable model bridging ray and wave optics for hybrid systems that enables end-to-end optimization of planar and conformal diffractive elements.
Sec2Drum-DAC renders drum audio from symbolic inputs via diffusion on PCA-reduced DAC latents, improving spectral and transient metrics over regression baselines on 1733 held-out windows.
A compact 0.09B model using hierarchical discrete tokenization and prompted latent translation outperforms larger baselines in cross-modal PPG-to-ECG synthesis and cross-frequency super-resolution.
MiniMind-O delivers a working 0.1B-scale open omni model with speech-native output, Thinker-Talker split, frozen encoders, and full release of code, checkpoints, and training data.
Chain-of-Details (CoD) is a cascaded TTS method that explicitly models temporal coarse-to-fine dynamics with a shared decoder, achieving competitive performance using significantly fewer parameters.
LLM-Codec augments audio codec training with multi-step token prediction and contrastive semantic alignment to improve both waveform reconstruction and autoregressive predictability for speech language models.
HCFD is a new pathology-aware benchmark and dataset for codec-fake audio detection in healthcare, with PHOENIX-Mamba achieving up to 97% accuracy by modeling fakes as modes in hyperbolic space.
CSLM achieves cross-modal and cross-lingual alignment in speech LLMs via continual pre-training on discrete tokens and speech-text interleaved instruction tuning, enabling scalability without massive speech datasets.
A Conformer-conditioned decoder-only language model generates discrete tokens via a neural audio codec to separate four music stems, reaching near state-of-the-art perceptual quality and top NISQA on vocals in MUSDB18-HQ tests.
Diff-VS is an efficient audio-aware diffusion U-Net for vocal separation that matches discriminative baselines on objective metrics while achieving state-of-the-art perceptual quality via proxy measures.
Qwen3-TTS delivers state-of-the-art multilingual TTS performance with 3-second voice cloning, description control, and ultra-low-latency streaming via dual tokenizers and a dual-track LM architecture trained on over 5 million hours of data.
citing papers explorer
-
How Neural Losses Shape VAE Latents
Neural reconstruction losses in VAEs reduce latent information content and produce more isotropic latent geometries with even uncertainty distribution.
-
Codec-Robust Attacks on Audio LLMs
CodecAttack perturbs audio in codec latent space with multi-bitrate EoT to achieve 85.5% average ASR on Opus-compressed Audio LLMs versus under 26% for waveform baselines, with transfer to MP3 and AAC.
-
Modeling Music as a Time-Frequency Image: A 2D Tokenizer for Music Generation
BandTok tokenizes Mel-spectrograms as independent time-frequency band tokens from a single codebook and pairs it with 2D RoPE in an autoregressive model to improve music generation over residual multi-codebook tokenizers.
-
AffectCodec: Emotion-Preserving Neural Speech Codec for Expressive Speech Modeling
AffectCodec is an emotion-guided neural speech codec that preserves emotional cues during quantization while maintaining semantic fidelity and prosodic naturalness.
-
VITA-QinYu: Expressive Spoken Language Model for Role-Playing and Singing
VITA-QinYu is the first expressive end-to-end spoken language model supporting role-playing and singing alongside conversation, trained on 15.8K hours of data and outperforming prior models on expressiveness and conversational benchmarks.
-
Indic-CodecFake meets SATYAM: Towards Detecting Neural Audio Codec Synthesized Speech Deepfakes in Indic Languages
Introduces the Indic-CodecFake dataset for Indic codec deepfakes and SATYAM, a novel hyperbolic ALM that outperforms baselines through dual-stage semantic-prosodic fusion using Bhattacharya distance.
-
When Spoof Detectors Travel: Evaluation Across 66 Languages in the Low-Resource Language Spoofing Corpus
LRLspoof corpus and threshold-transfer evaluation demonstrate that spoof detection performance varies markedly across languages, identifying language as an independent domain shift factor.
-
Steering Autoregressive Music Generation with Recursive Feature Machines
MusicRFM discovers interpretable concept directions in music model hidden states using RFM probes and injects them at inference to steer generation toward desired musical properties without retraining.
-
Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation
A new shared video-image tokenizer enables large language models to surpass diffusion models on standard visual generation benchmarks.
-
Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers
VALL-E is a neural codec language model trained on 60K hours of speech that performs zero-shot TTS, synthesizing natural speech that matches an unseen speaker's voice, emotion, and environment from a 3-second prompt.
-
Audio Pirates: Black-box Audio Watermark Removal via Diffusion Priors
DiffErase removes black-box audio watermarks via diffusion priors by adding intermediate noise and regenerating with a pretrained model, preserving quality across audio domains.
-
Taming Audio VAEs via Target-KL Regularization
The paper introduces target-KL regularization to train audio VAEs at specific bitrates, enabling rate-distortion curves and comparison to discrete audio codecs for improved text-to-sound generation.
-
A General Differentiable Ray-Wave Framework for Hybrid Refractive-Diffractive System Modeling and Optimization
A plug-and-play differentiable model bridging ray and wave optics for hybrid systems that enables end-to-end optimization of planar and conformal diffractive elements.
-
Seconds-Aligned PCA-DAC Latent Diffusion for Symbolic-to-Audio Drum Rendering
Sec2Drum-DAC renders drum audio from symbolic inputs via diffusion on PCA-reduced DAC latents, improving spectral and transient metrics over regression baselines on 1733 held-out windows.
-
Compact Latent Manifold Translation: A Parameter-Efficient Foundation Model for Cross-Modal and Cross-Frequency Physiological Signal Synthesis
A compact 0.09B model using hierarchical discrete tokenization and prompted latent translation outperforms larger baselines in cross-modal PPG-to-ECG synthesis and cross-frequency super-resolution.
-
Text-To-Speech with Chain-of-Details: modeling temporal dynamics in speech generation
Chain-of-Details (CoD) is a cascaded TTS method that explicitly models temporal coarse-to-fine dynamics with a shared decoder, achieving competitive performance using significantly fewer parameters.
-
LLM-Codec: Neural Audio Codec Meets Language Model Objectives
LLM-Codec augments audio codec training with multi-step token prediction and contrastive semantic alignment to improve both waveform reconstruction and autoregressive predictability for speech language models.
-
HCFD: A Benchmark for Audio Deepfake Detection in Healthcare
HCFD is a new pathology-aware benchmark and dataset for codec-fake audio detection in healthcare, with PHOENIX-Mamba achieving up to 97% accuracy by modeling fakes as modes in hyperbolic space.
-
Efficient Training for Cross-lingual Speech Language Models
CSLM achieves cross-modal and cross-lingual alignment in speech LLMs via continual pre-training on discrete tokens and speech-text interleaved instruction tuning, enabling scalability without massive speech datasets.
-
Discrete Token Modeling for Multi-Stem Music Source Separation with Language Models
A Conformer-conditioned decoder-only language model generates discrete tokens via a neural audio codec to separate four music stems, reaching near state-of-the-art perceptual quality and top NISQA on vocals in MUSDB18-HQ tests.
-
Diff-VS: Efficient Audio-Aware Diffusion U-Net for Vocals Separation
Diff-VS is an efficient audio-aware diffusion U-Net for vocal separation that matches discriminative baselines on objective metrics while achieving state-of-the-art perceptual quality via proxy measures.
-
Qwen3-TTS Technical Report
Qwen3-TTS delivers state-of-the-art multilingual TTS performance with 3-second voice cloning, description control, and ultra-low-latency streaming via dual tokenizers and a dual-track LM architecture trained on over 5 million hours of data.
-
Two-Dimensional Quantization for Geometry-Aware Audio Coding
Q2D2 uses 2D geometric grid projections to quantize feature pairs in neural audio codecs, yielding implicit codebooks that improve efficiency and utilization over RVQ, VQ, and FSQ while maintaining reconstruction quality.
-
Step-Audio 2 Technical Report
Step-Audio 2 integrates a latent audio encoder, reasoning-centric reinforcement learning, and discrete audio token generation into language modeling to deliver state-of-the-art performance on audio understanding and conversational benchmarks.
-
Meta Audiobox Aesthetics: Unified Automatic Quality Assessment for Speech, Music, and Sound
Unified no-reference models assess audio aesthetics across speech, music, and sound via four perceptual axes and achieve performance comparable or superior to human mean opinion scores.
-
CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation
CogACT is a new VLA model that uses a conditioned diffusion action transformer to achieve over 35% higher average success rates than OpenVLA in simulation and 55% in real-robot experiments while generalizing to new robots and objects.
-
Mechanisms of Multimodal Synchronization: Insights from Decoder-Based Video-Text-to-Speech Synthesis
Experiments with a video-text-to-speech transformer show co-temporal positional indexing enables synchronization without timestamps, text and video supply complementary signals, and modality ordering creates a trade-off between in-domain accuracy and cross-domain generalization.
-
Compact 3D Gaussian Splatting For Dense Visual SLAM
A compact 3D Gaussian Splatting SLAM system reduces Gaussian count and parameter size via masking and a geometry codebook while preserving SOTA reconstruction quality and pose accuracy.
-
Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models
Qwen-Audio trains a unified model on diverse audio and tasks with hierarchical tags to enable strong zero-shot performance on audio understanding benchmarks and multi-turn audio chat.
-
AudioPaLM: A Large Language Model That Can Speak and Listen
AudioPaLM unifies PaLM-2 and AudioLM to outperform prior systems on speech translation while enabling zero-shot speech-to-text for many unseen language pairs and voice transfer from short prompts.
-
Thinking-while-speaking: A Controlled, Interleaved Reasoning Method for Real-Time Speech Generation
InterRS enables real-time speech generation with interleaved reasoning via a controlled data pipeline, interleaved SFT, and RL using TA-Balance and Linguistic Quality rewards, yielding 13% gains on math and logic benchmarks.
-
Drum Synthesis from Expressive Drum Grids via Neural Audio Codecs
A Transformer predicts tokens from neural audio codecs (EnCodec, DAC, X-Codec) to convert expressive drum grids into audio, trained and evaluated on the E-GMD dataset using objective metrics.
-
Diffusion Reconstruction towards Generalizable Audio Deepfake Detection
Diffusion reconstruction creates hard samples for audio deepfake detection training, and when paired with feature aggregation and RACL, it reduces average EER versus baselines.
-
Adopting State-of-the-Art Pretrained Audio Representations for Music Recommender Systems
Pretrained audio models show large performance gaps between standard MIR tasks and music recommendation in both hot and cold-start settings.
-
Voxtral TTS
Voxtral TTS produces expressive multilingual speech from 3-second reference audio with a hybrid autoregressive-plus-flow-matching architecture and a new VQ-FSQ tokenizer, achieving 68.4% win rate over ElevenLabs in human evaluations.
-
Expectation and Acoustic Neural Network Representations Enhance Music Identification from Brain Activity
Separating acoustic and expectation ANN representations as teacher targets improves EEG music identification beyond baselines and seed ensembles.
-
A Simple Method to Enhance Pre-trained Language Models with Speech Tokens for Classification
Lasso-selected speech tokens enhance text LLMs for multimodal classification by reducing long audio sequences to task-relevant features via self-supervised adaptation.
-
XAttnMark: Learning Robust Audio Watermarking with Cross-Attention
XAttnMark is a new neural audio watermarking method using partial parameter sharing, cross-attention for message retrieval, temporal conditioning, and a psychoacoustic TF masking loss that reports state-of-the-art detection and attribution robustness.
-
Movie Gen: A Cast of Media Foundation Models
A 30B-parameter transformer and related models generate high-quality videos and audio, claiming state-of-the-art results on text-to-video, video editing, personalization, and audio generation tasks.
-
F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching
F5-TTS generates natural speech from text via flow matching on DiT with simple text padding, ConvNeXt refinement, and sway sampling, trained on 100K hours multilingual data.
-
Expressive Prompting: Improving Emotion Intensity and Speaker Consistency in Zero-Shot TTS
A two-stage static-then-dynamic prompt selection strategy using prosodic features, LLM coherence scores, and similarity metrics improves emotion intensity and speaker consistency in zero-shot TTS.