hub Canonical reference

High Fidelity Neural Audio Compression

· 2022 · eess.AS · arXiv 2210.13438

Canonical reference. 70% of citing Pith papers cite this work as background.

61 Pith papers citing it

Background 70% of classified citations

open full Pith review browse 61 citing papers arXiv PDF

abstract

We introduce a state-of-the-art real-time, high-fidelity, audio codec leveraging neural networks. It consists in a streaming encoder-decoder architecture with quantized latent space trained in an end-to-end fashion. We simplify and speed-up the training by using a single multiscale spectrogram adversary that efficiently reduces artifacts and produce high-quality samples. We introduce a novel loss balancer mechanism to stabilize training: the weight of a loss now defines the fraction of the overall gradient it should represent, thus decoupling the choice of this hyper-parameter from the typical scale of the loss. Finally, we study how lightweight Transformer models can be used to further compress the obtained representation by up to 40%, while staying faster than real time. We provide a detailed description of the key design choices of the proposed model including: training objective, architectural changes and a study of various perceptual loss functions. We present an extensive subjective evaluation (MUSHRA tests) together with an ablation study for a range of bandwidths and audio domains, including speech, noisy-reverberant speech, and music. Our approach is superior to the baselines methods across all evaluated settings, considering both 24 kHz monophonic and 48 kHz stereophonic audio. Code and models are available at github.com/facebookresearch/encodec.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 7 method 2 other 1

citation-polarity summary

background 7 use method 2 unclear 1

representative citing papers

MusicLM: Generating Music From Text

cs.SD · 2023-01-26 · conditional · novelty 8.0

MusicLM produces coherent multi-minute 24 kHz music from text prompts using hierarchical sequence-to-sequence modeling and outperforms prior systems in quality and text adherence.

FlexiSLM: A Dynamic and Controllable Frame Rate Spoken Language Model

cs.SD · 2026-06-30 · unverdicted · novelty 7.0

FlexiSLM is the first spoken language model supporting dynamic and controllable frame rates on speech input and output, outperforming fixed-rate 7B models at high quality and enabling faster inference at lower rates like 6.25 Hz.

DTM-Codec: Dynamic Token Masking for VFR Speech Coding with Efficient Boundary Selection

eess.AS · 2026-06-28 · unverdicted · novelty 7.0

DTM-Codec achieves better reconstruction quality and intelligibility than fixed-frame-rate neural speech codecs at matched total bitrate via dynamic token masking and Path Length Equalization for variable frame rates.

How Neural Losses Shape VAE Latents

cs.LG · 2026-05-30 · unverdicted · novelty 7.0

Neural reconstruction losses in VAEs reduce latent information content and produce more isotropic latent geometries with even uncertainty distribution.

Prompt Codebooks: Discrete Compositional Optimization for Language Model Instruction Refinement

cs.AI · 2026-05-27 · unverdicted · novelty 7.0

Prompt Codebooks recasts automatic prompt optimization as discrete learning over a finite vocabulary of atomic natural-language instincts with per-instance routing, yielding up to +30.36 point gains over zero-shot and shorter prompts on six benchmarks.

Academic Text-to-Music Grand Challenge: Datasets, Baselines, and Evaluation Methods

cs.SD · 2026-05-20 · unverdicted · novelty 7.0 · 2 refs

Presents the ATTM grand challenge with efficiency and performance tracks for text-to-music generation using a public instrumental music dataset, evaluated via FAD, CLAP, a new CCS metric, and subjective tests.

Codec-Robust Attacks on Audio LLMs

cs.SD · 2026-05-19 · unverdicted · novelty 7.0 · 2 refs

CodecAttack perturbs audio in codec latent space with multi-bitrate EoT to achieve 85.5% average ASR on Opus-compressed Audio LLMs versus under 26% for waveform baselines, with transfer to MP3 and AAC.

Modeling Music as a Time-Frequency Image: A 2D Tokenizer for Music Generation

cs.SD · 2026-05-15 · unverdicted · novelty 7.0

BandTok tokenizes Mel-spectrograms as independent time-frequency band tokens from a single codebook and pairs it with 2D RoPE in an autoregressive model to improve music generation over residual multi-codebook tokenizers.

AffectCodec: Emotion-Preserving Neural Speech Codec for Expressive Speech Modeling

cs.SD · 2026-05-11 · unverdicted · novelty 7.0

AffectCodec is an emotion-guided neural speech codec that preserves emotional cues during quantization while maintaining semantic fidelity and prosodic naturalness.

VITA-QinYu: Expressive Spoken Language Model for Role-Playing and Singing

cs.CL · 2026-05-07 · unverdicted · novelty 7.0

VITA-QinYu is the first expressive end-to-end spoken language model supporting role-playing and singing alongside conversation, trained on 15.8K hours of data and outperforming prior models on expressiveness and conversational benchmarks.

PairAlign: A Framework for Sequence Tokenization via Self-Alignment with Applications to Audio Tokenization

cs.LG · 2026-05-07 · unverdicted · novelty 7.0 · 2 refs

PairAlign learns compact variable-length token sequences for audio via self-alignment on paired content-preserving views, achieving 55% fewer archive tokens than VQ while preserving edit-distance retrieval at 12.71 tokens/s.

Indic-CodecFake meets SATYAM: Towards Detecting Neural Audio Codec Synthesized Speech Deepfakes in Indic Languages

eess.AS · 2026-04-21 · unverdicted · novelty 7.0

Introduces the Indic-CodecFake dataset for Indic codec deepfakes and SATYAM, a novel hyperbolic ALM that outperforms baselines through dual-stage semantic-prosodic fusion using Bhattacharya distance.

Steering Autoregressive Music Generation with Recursive Feature Machines

cs.LG · 2025-10-21 · unverdicted · novelty 7.0

MusicRFM discovers interpretable concept directions in music model hidden states using RFM probes and injects them at inference to steer generation toward desired musical properties without retraining.

Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation

cs.CV · 2023-10-09 · unverdicted · novelty 7.0

A new shared video-image tokenizer enables large language models to surpass diffusion models on standard visual generation benchmarks.

Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers

cs.CL · 2023-01-05 · unverdicted · novelty 7.0

VALL-E is a neural codec language model trained on 60K hours of speech that performs zero-shot TTS, synthesizing natural speech that matches an unseen speaker's voice, emotion, and environment from a 3-second prompt.

AnchorSteer: Self-Discovered Concept Injection for Structure-Preserving Music Editing

cs.SD · 2026-05-29 · unverdicted · novelty 6.0

AnchorSteer couples self-discovered semantic concept vectors with structural anchoring in diffusion models to achieve controllable music editing with preserved structure.

Audio Pirates: Black-box Audio Watermark Removal via Diffusion Priors

cs.CR · 2026-05-28 · unverdicted · novelty 6.0

DiffErase removes black-box audio watermarks via diffusion priors by adding intermediate noise and regenerating with a pretrained model, preserving quality across audio domains.

Taming Audio VAEs via Target-KL Regularization

cs.SD · 2026-05-16 · unverdicted · novelty 6.0

The paper introduces target-KL regularization to train audio VAEs at specific bitrates, enabling rate-distortion curves and comparison to discrete audio codecs for improved text-to-sound generation.

A General Differentiable Ray-Wave Framework for Hybrid Refractive-Diffractive System Modeling and Optimization

physics.optics · 2026-05-14 · unverdicted · novelty 6.0

A plug-and-play differentiable model bridging ray and wave optics for hybrid systems that enables end-to-end optimization of planar and conformal diffractive elements.

Seconds-Aligned PCA-DAC Latent Diffusion for Symbolic-to-Audio Drum Rendering

cs.SD · 2026-05-13 · unverdicted · novelty 6.0

Sec2Drum-DAC renders drum audio from symbolic inputs via diffusion on PCA-reduced DAC latents, improving spectral and transient metrics over regression baselines on 1733 held-out windows.

Compact Latent Manifold Translation: A Parameter-Efficient Foundation Model for Cross-Modal and Cross-Frequency Physiological Signal Synthesis

eess.SP · 2026-05-13 · unverdicted · novelty 6.0

A compact 0.09B model using hierarchical discrete tokenization and prompted latent translation outperforms larger baselines in cross-modal PPG-to-ECG synthesis and cross-frequency super-resolution.

MiniMind-O Technical Report: An Open Small-Scale Speech-Native Omni Model

cs.SD · 2026-05-05 · accept · novelty 6.0

MiniMind-O delivers a working 0.1B-scale open omni model with speech-native output, Thinker-Talker split, frozen encoders, and full release of code, checkpoints, and training data.

Text-To-Speech with Chain-of-Details: modeling temporal dynamics in speech generation

eess.AS · 2026-04-21 · unverdicted · novelty 6.0

Chain-of-Details (CoD) is a cascaded TTS method that explicitly models temporal coarse-to-fine dynamics with a shared decoder, achieving competitive performance using significantly fewer parameters.

LLM-Codec: Neural Audio Codec Meets Language Model Objectives

cs.SD · 2026-04-20 · unverdicted · novelty 6.0

LLM-Codec augments audio codec training with multi-step token prediction and contrastive semantic alignment to improve both waveform reconstruction and autoregressive predictability for speech language models.

citing papers explorer

Showing 16 of 16 citing papers after filters.

FlexiSLM: A Dynamic and Controllable Frame Rate Spoken Language Model cs.SD · 2026-06-30 · unverdicted · none · ref 121 · internal anchor
FlexiSLM is the first spoken language model supporting dynamic and controllable frame rates on speech input and output, outperforming fixed-rate 7B models at high quality and enabling faster inference at lower rates like 6.25 Hz.
Academic Text-to-Music Grand Challenge: Datasets, Baselines, and Evaluation Methods cs.SD · 2026-05-20 · unverdicted · none · ref 7 · 2 links · internal anchor
Presents the ATTM grand challenge with efficiency and performance tracks for text-to-music generation using a public instrumental music dataset, evaluated via FAD, CLAP, a new CCS metric, and subjective tests.
Codec-Robust Attacks on Audio LLMs cs.SD · 2026-05-19 · unverdicted · none · ref 16 · 2 links · internal anchor
CodecAttack perturbs audio in codec latent space with multi-bitrate EoT to achieve 85.5% average ASR on Opus-compressed Audio LLMs versus under 26% for waveform baselines, with transfer to MP3 and AAC.
Modeling Music as a Time-Frequency Image: A 2D Tokenizer for Music Generation cs.SD · 2026-05-15 · unverdicted · none · ref 2 · internal anchor
BandTok tokenizes Mel-spectrograms as independent time-frequency band tokens from a single codebook and pairs it with 2D RoPE in an autoregressive model to improve music generation over residual multi-codebook tokenizers.
AffectCodec: Emotion-Preserving Neural Speech Codec for Expressive Speech Modeling cs.SD · 2026-05-11 · unverdicted · none · ref 12 · internal anchor
AffectCodec is an emotion-guided neural speech codec that preserves emotional cues during quantization while maintaining semantic fidelity and prosodic naturalness.
AnchorSteer: Self-Discovered Concept Injection for Structure-Preserving Music Editing cs.SD · 2026-05-29 · unverdicted · none · ref 2 · internal anchor
AnchorSteer couples self-discovered semantic concept vectors with structural anchoring in diffusion models to achieve controllable music editing with preserved structure.
Taming Audio VAEs via Target-KL Regularization cs.SD · 2026-05-16 · unverdicted · none · ref 32 · internal anchor
The paper introduces target-KL regularization to train audio VAEs at specific bitrates, enabling rate-distortion curves and comparison to discrete audio codecs for improved text-to-sound generation.
Seconds-Aligned PCA-DAC Latent Diffusion for Symbolic-to-Audio Drum Rendering cs.SD · 2026-05-13 · unverdicted · none · ref 8 · internal anchor
Sec2Drum-DAC renders drum audio from symbolic inputs via diffusion on PCA-reduced DAC latents, improving spectral and transient metrics over regression baselines on 1733 held-out windows.
LLM-Codec: Neural Audio Codec Meets Language Model Objectives cs.SD · 2026-04-20 · unverdicted · none · ref 1 · internal anchor
LLM-Codec augments audio codec training with multi-step token prediction and contrastive semantic alignment to improve both waveform reconstruction and autoregressive predictability for speech language models.
Qwen3-TTS Technical Report cs.SD · 2026-01-22 · unverdicted · none · ref 3 · internal anchor
Qwen3-TTS delivers state-of-the-art multilingual TTS performance with 3-second voice cloning, description control, and ultra-low-latency streaming via dual tokenizers and a dual-track LM architecture trained on over 5 million hours of data.
Two-Dimensional Quantization for Geometry-Aware Audio Coding cs.SD · 2025-12-01 · unverdicted · none · ref 19 · internal anchor
Q2D2 uses 2D geometric grid projections to quantize feature pairs in neural audio codecs, yielding implicit codebooks that improve efficiency and utilization over RVQ, VQ, and FSQ while maintaining reconstruction quality.
Meta Audiobox Aesthetics: Unified Automatic Quality Assessment for Speech, Music, and Sound cs.SD · 2025-02-07 · unverdicted · none · ref 60 · internal anchor
Unified no-reference models assess audio aesthetics across speech, music, and sound via four perceptual axes and achieve performance comparable or superior to human mean opinion scores.
Drum Synthesis from Expressive Drum Grids via Neural Audio Codecs cs.SD · 2026-05-11 · unverdicted · none · ref 4 · internal anchor
A Transformer predicts tokens from neural audio codecs (EnCodec, DAC, X-Codec) to convert expressive drum grids into audio, trained and evaluated on the E-GMD dataset using objective metrics.
Diffusion Reconstruction towards Generalizable Audio Deepfake Detection cs.SD · 2026-04-29 · unverdicted · none · ref 21 · internal anchor
Diffusion reconstruction creates hard samples for audio deepfake detection training, and when paired with feature aggregation and RACL, it reduces average EER versus baselines.
XAttnMark: Learning Robust Audio Watermarking with Cross-Attention cs.SD · 2025-02-06 · unverdicted · none · ref 19 · internal anchor
XAttnMark is a new neural audio watermarking method using partial parameter sharing, cross-attention for message retrieval, temporal conditioning, and a psychoacoustic TF masking loss that reports state-of-the-art detection and attribution robustness.
Expressive Prompting: Improving Emotion Intensity and Speaker Consistency in Zero-Shot TTS cs.SD · 2024-09-27 · unverdicted · none · ref 6 · internal anchor
A two-stage static-then-dynamic prompt selection strategy using prosodic features, LLM coherence scores, and similarity metrics improves emotion intensity and speaker consistency in zero-shot TTS.

High Fidelity Neural Audio Compression

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer