hub Canonical reference

High Fidelity Neural Audio Compression

· 2022 · eess.AS · arXiv 2210.13438

Canonical reference. 70% of citing Pith papers cite this work as background.

88 Pith papers citing it

Background 70% of classified citations

open full Pith review browse 88 citing papers arXiv PDF

abstract

We introduce a state-of-the-art real-time, high-fidelity, audio codec leveraging neural networks. It consists in a streaming encoder-decoder architecture with quantized latent space trained in an end-to-end fashion. We simplify and speed-up the training by using a single multiscale spectrogram adversary that efficiently reduces artifacts and produce high-quality samples. We introduce a novel loss balancer mechanism to stabilize training: the weight of a loss now defines the fraction of the overall gradient it should represent, thus decoupling the choice of this hyper-parameter from the typical scale of the loss. Finally, we study how lightweight Transformer models can be used to further compress the obtained representation by up to 40%, while staying faster than real time. We provide a detailed description of the key design choices of the proposed model including: training objective, architectural changes and a study of various perceptual loss functions. We present an extensive subjective evaluation (MUSHRA tests) together with an ablation study for a range of bandwidths and audio domains, including speech, noisy-reverberant speech, and music. Our approach is superior to the baselines methods across all evaluated settings, considering both 24 kHz monophonic and 48 kHz stereophonic audio. Code and models are available at github.com/facebookresearch/encodec.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 7 method 2 other 1

citation-polarity summary

background 7 use method 2 unclear 1

representative citing papers

WavTTS: Towards High-Quality Zero-Shot TTS via Direct Raw Waveform Modeling

eess.AS · 2026-06-02 · unverdicted · novelty 8.0

WavTTS is the first raw-waveform diffusion TTS model using DiT flow matching and multi-scale mel supervision that approaches SOTA latent zero-shot performance while beating prior end-to-end models.

MusicLM: Generating Music From Text

cs.SD · 2023-01-26 · conditional · novelty 8.0

MusicLM produces coherent multi-minute 24 kHz music from text prompts using hierarchical sequence-to-sequence modeling and outperforms prior systems in quality and text adherence.

HERMES: A Multi-Granularity Labeling Substrate for Pre-training Data Mixtures

cs.LG · 2026-07-02 · unverdicted · novelty 7.0

HERMES provides a reusable hierarchical labeling substrate for pre-training data that reveals granularity-specific effects in data mixing rules during model training.

FlexiSLM: A Dynamic and Controllable Frame Rate Spoken Language Model

cs.SD · 2026-06-30 · unverdicted · novelty 7.0

FlexiSLM is the first spoken language model supporting dynamic and controllable frame rates on speech input and output, outperforming fixed-rate 7B models at high quality and enabling faster inference at lower rates like 6.25 Hz.

DTM-Codec: Dynamic Token Masking for VFR Speech Coding with Efficient Boundary Selection

eess.AS · 2026-06-28 · unverdicted · novelty 7.0

DTM-Codec achieves better reconstruction quality and intelligibility than fixed-frame-rate neural speech codecs at matched total bitrate via dynamic token masking and Path Length Equalization for variable frame rates.

AudioCALM: Continuous Autoregressive Language Modeling for Universal Audio Generation

eess.AS · 2026-06-22 · unverdicted · novelty 7.0

AudioCALM presents a continuous autoregressive framework with flow-matching prediction and A-MoME architecture that unifies speech, sound, and music generation while matching modality-specific state-of-the-art performance.

NAC: Neural Action Codec for Vision-Language-Action Models

cs.RO · 2026-06-19 · unverdicted · novelty 7.0

NAC adapts multi-scale RVQGAN audio codecs with kinematic-specific losses to produce ordered action tokens that yield lower reconstruction error and higher task success than prior tokenizers in VLA models.

A Survey of Full-Duplex Spoken Dialogue Systems: Architectural Hierarchy, Interaction Ontology, and Decision State Machine

eess.AS · 2026-06-17 · accept · novelty 7.0

A survey proposing an L0-L3 architectural hierarchy, T×I×R interaction ontology, and IDLE/LISTEN/SPEAK/WAIT/DUAL decision state machine for full-duplex spoken dialogue systems, documenting a realization gap between architectural potential and observed behavior due to training data limits.

HoliDubber: Holistic Video Dubbing for Complex Acoustic Scenes via Text-Guided Audio Synthesis

eess.AS · 2026-06-08 · unverdicted · novelty 7.0

HoliDubber introduces a patch-based autoregressive diffusion transformer for joint text-guided synthesis of speech and ambient audio in video dubbing, with a new benchmark showing outperformance over prior speech-only methods.

Probing Spatial Structure in Pretrained Audio Representations

cs.SD · 2026-06-04 · unverdicted · novelty 7.0

Introduces SARL benchmark showing pretrained audio encoders encode source-level spatial factors more readily than room-level factors, with patterns shaped by input configuration and training paradigm.

How Neural Losses Shape VAE Latents

cs.LG · 2026-05-30 · unverdicted · novelty 7.0

Neural reconstruction losses in VAEs reduce latent information content and produce more isotropic latent geometries with even uncertainty distribution.

Prompt Codebooks: Discrete Compositional Optimization for Language Model Instruction Refinement

cs.AI · 2026-05-27 · unverdicted · novelty 7.0

Prompt Codebooks recasts automatic prompt optimization as discrete learning over a finite vocabulary of atomic natural-language instincts with per-instance routing, yielding up to +30.36 point gains over zero-shot and shorter prompts on six benchmarks.

Academic Text-to-Music Grand Challenge: Datasets, Baselines, and Evaluation Methods

cs.SD · 2026-05-20 · unverdicted · novelty 7.0 · 2 refs

Presents the ATTM grand challenge with efficiency and performance tracks for text-to-music generation using a public instrumental music dataset, evaluated via FAD, CLAP, a new CCS metric, and subjective tests.

Codec-Robust Attacks on Audio LLMs

cs.SD · 2026-05-19 · unverdicted · novelty 7.0 · 2 refs

CodecAttack perturbs audio in codec latent space with multi-bitrate EoT to achieve 85.5% average ASR on Opus-compressed Audio LLMs versus under 26% for waveform baselines, with transfer to MP3 and AAC.

Modeling Music as a Time-Frequency Image: A 2D Tokenizer for Music Generation

cs.SD · 2026-05-15 · unverdicted · novelty 7.0

BandTok tokenizes Mel-spectrograms as independent time-frequency band tokens from a single codebook and pairs it with 2D RoPE in an autoregressive model to improve music generation over residual multi-codebook tokenizers.

AffectCodec: Emotion-Preserving Neural Speech Codec for Expressive Speech Modeling

cs.SD · 2026-05-11 · unverdicted · novelty 7.0

AffectCodec is an emotion-guided neural speech codec that preserves emotional cues during quantization while maintaining semantic fidelity and prosodic naturalness.

VITA-QinYu: Expressive Spoken Language Model for Role-Playing and Singing

cs.CL · 2026-05-07 · unverdicted · novelty 7.0

VITA-QinYu is the first expressive end-to-end spoken language model supporting role-playing and singing alongside conversation, trained on 15.8K hours of data and outperforming prior models on expressiveness and conversational benchmarks.

PairAlign: A Framework for Sequence Tokenization via Self-Alignment with Applications to Audio Tokenization

cs.LG · 2026-05-07 · unverdicted · novelty 7.0 · 2 refs

PairAlign learns compact variable-length token sequences for audio via self-alignment on paired content-preserving views, achieving 55% fewer archive tokens than VQ while preserving edit-distance retrieval at 12.71 tokens/s.

Indic-CodecFake meets SATYAM: Towards Detecting Neural Audio Codec Synthesized Speech Deepfakes in Indic Languages

eess.AS · 2026-04-21 · unverdicted · novelty 7.0

Introduces the Indic-CodecFake dataset for Indic codec deepfakes and SATYAM, a novel hyperbolic ALM that outperforms baselines through dual-stage semantic-prosodic fusion using Bhattacharya distance.

Steering Autoregressive Music Generation with Recursive Feature Machines

cs.LG · 2025-10-21 · unverdicted · novelty 7.0

MusicRFM discovers interpretable concept directions in music model hidden states using RFM probes and injects them at inference to steer generation toward desired musical properties without retraining.

Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation

cs.CV · 2023-10-09 · unverdicted · novelty 7.0

A new shared video-image tokenizer enables large language models to surpass diffusion models on standard visual generation benchmarks.

Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers

cs.CL · 2023-01-05 · unverdicted · novelty 7.0

VALL-E is a neural codec language model trained on 60K hours of speech that performs zero-shot TTS, synthesizing natural speech that matches an unseen speaker's voice, emotion, and environment from a 3-second prompt.

Learning to Evade: Adaptive Attacks on Audio Watermarking

cs.SD · 2026-06-21 · unverdicted · novelty 6.0

Introduces AWM adaptive attack using two-stage optimization and distribution estimation to bypass audio watermark detectors with low detection rates on voice datasets.

Bridging the Age Gap: Towards Detecting Neural Audio Codec Synthesized Elderly Speech Deepfake

eess.AS · 2026-06-19 · unverdicted · novelty 6.0

Defines ECFD task, releases ECF dataset, demonstrates poor generalization of prior detectors to elderly speech, and introduces BONSAI fusion of LanguageBind and ImageBind achieving 1.66% average EER.

citing papers explorer

Showing 28 of 78 citing papers after filters.

AudioPaLM: A Large Language Model That Can Speak and Listen cs.CL · 2023-06-22 · unverdicted · none · ref 18 · 2 links · internal anchor
AudioPaLM unifies PaLM-2 and AudioLM to outperform prior systems on speech translation while enabling zero-shot speech-to-text for many unseen language pairs and voice transfer from short prompts.
FlowTTS-GRPO: Online Reinforcement Learning with Multi-Objective Reward Optimization for Flow-Matching Based Text-to-Speech eess.AS · 2026-06-22 · unverdicted · none · ref 17 · internal anchor
FlowTTS-GRPO applies online RL with weighted multi-objective rewards to flow-matching TTS models via ODE-to-SDE conversion, reporting gains in speaker similarity and perceptual quality on CosyVoice 3.0 and F5-TTS.
AugCodec: A Low-Bitrate Disentangled Neural Speech Codec via Data Augmentation cs.SD · 2026-06-20 · unverdicted · none · ref 7 · internal anchor
AugCodec disentangles speech into semantic, speaker, and prosody tokens via tailored data augmentations, achieving 12.5 Hz operation with three streams and outperforming prior codecs on LibriSpeech reconstruction and disentanglement metrics.
Which Speech Representation Better Matches Text-Native Reasoning? A Study of Speech-Text Alignment on Frame Rate and Representation eess.AS · 2026-06-10 · unverdicted · none · ref 21 · internal anchor
Empirical sweep finds 4.17 Hz frame rate plus intermediate-layer alignment optimal for speech QA under frozen text LLM backbone.
Feature-Aligned Speech Watermarking for Robustness to Reconstruction Distortions cs.SD · 2026-06-10 · unverdicted · none · ref 13 · internal anchor
Feature-aligned watermarking embeds a codec-generated pseudo-speech signal into the spectrogram to raise robustness against reconstruction models while keeping imperceptibility comparable to prior methods.
ContextCodec: Content-Focused Context Guidance for Ultra-Low Bitrate Speech Coding cs.SD · 2026-06-09 · unverdicted · none · ref 22 · internal anchor
ContextCodec uses a dual-branch encoder with CLIP-style contrastive training on phoneme-aligned context features plus autoregressive refinement to improve quality-intelligibility at bitrates down to 500 bps.
F3-Tokenizer: Taming Audio Autoencoder Latents for Understanding and Generation cs.SD · 2026-06-04 · unverdicted · none · ref 3 · internal anchor
F3-Tokenizer adapts audio autoencoder latents with noise-regularized bottleneck (channel normalization and stochastic perturbation) and a representation encoder (RQ-MTP plus frozen-LLM supervision) to support both high-dimensional understanding representations and normalized continuous generation ta
Wavelet as Tokenizer: Preliminary Results on a Shared Wavelet Token Schema for Natural Signals eess.AS · 2026-05-30 · unverdicted · none · ref 11 · internal anchor
A continuous-token model with shared Haar wavelet coefficients reports 39.92 dB audio, 29.37 dB image, and 23.93 dB video PSNR on three datasets and shows energy-based selection outperforms uniform selection by roughly 16 dB.
Multimodal Music Recommendation System using LLMs cs.IR · 2026-05-28 · unverdicted · none · ref 3 · internal anchor
Extending E4SRec with multimodal content features on LastFM-1K yields up to 95% Recall and 79% NDCG gains over ID-only baselines, though naive fusion does not always improve results.
Thinking-while-speaking: A Controlled, Interleaved Reasoning Method for Real-Time Speech Generation cs.CL · 2026-05-20 · unverdicted · none · ref 26 · internal anchor
InterRS enables real-time speech generation with interleaved reasoning via a controlled data pipeline, interleaved SFT, and RL using TA-Balance and Linguistic Quality rewards, yielding 13% gains on math and logic benchmarks.
A Survey of Automated Presentation Coaching: Systems, Methods, and Open Challenges cs.CL · 2026-05-11 · unverdicted · none · ref 106 · internal anchor
A survey introduces a five-dimensional taxonomy for automated presentation coaching systems, maps existing work onto it, and identifies open challenges including data scarcity and accent fairness.
Drum Synthesis from Expressive Drum Grids via Neural Audio Codecs cs.SD · 2026-05-11 · unverdicted · none · ref 4 · internal anchor
A Transformer predicts tokens from neural audio codecs (EnCodec, DAC, X-Codec) to convert expressive drum grids into audio, trained and evaluated on the E-GMD dataset using objective metrics.
Diffusion Reconstruction towards Generalizable Audio Deepfake Detection cs.SD · 2026-04-29 · unverdicted · none · ref 21 · internal anchor
Diffusion reconstruction creates hard samples for audio deepfake detection training, and when paired with feature aggregation and RACL, it reduces average EER versus baselines.
Adopting State-of-the-Art Pretrained Audio Representations for Music Recommender Systems cs.IR · 2026-04-25 · unverdicted · none · ref 19 · internal anchor
Pretrained audio models show large performance gaps between standard MIR tasks and music recommendation in both hot and cold-start settings.
Voxtral TTS cs.AI · 2026-03-26 · unverdicted · none · ref 4 · internal anchor
Voxtral TTS produces expressive multilingual speech from 3-second reference audio with a hybrid autoregressive-plus-flow-matching architecture and a new VQ-FSQ tokenizer, achieving 68.4% win rate over ElevenLabs in human evaluations.
Expectation and Acoustic Neural Network Representations Enhance Music Identification from Brain Activity cs.AI · 2026-03-03 · unverdicted · none · ref 46 · internal anchor
Separating acoustic and expectation ANN representations as teacher targets improves EEG music identification beyond baselines and seed ensembles.
A Simple Method to Enhance Pre-trained Language Models with Speech Tokens for Classification cs.CL · 2025-12-08 · unverdicted · none · ref 11 · internal anchor
Lasso-selected speech tokens enhance text LLMs for multimodal classification by reducing long audio sequences to task-relevant features via self-supervised adaptation.
XAttnMark: Learning Robust Audio Watermarking with Cross-Attention cs.SD · 2025-02-06 · unverdicted · none · ref 19 · internal anchor
XAttnMark is a new neural audio watermarking method using partial parameter sharing, cross-attention for message retrieval, temporal conditioning, and a psychoacoustic TF masking loss that reports state-of-the-art detection and attribution robustness.
Movie Gen: A Cast of Media Foundation Models cs.CV · 2024-10-17 · unverdicted · none · ref 15 · internal anchor
A 30B-parameter transformer and related models generate high-quality videos and audio, claiming state-of-the-art results on text-to-video, video editing, personalization, and audio generation tasks.
F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching eess.AS · 2024-10-09 · unverdicted · none · ref 88 · internal anchor
F5-TTS generates natural speech from text via flow matching on DiT with simple text padding, ConvNeXt refinement, and sway sampling, trained on 100K hours multilingual data.
wav2tok 2.0: Scalable Audio Tokenization Maintaining Explicit Pairwise Token Alignment for Efficient Audio Retrieval cs.SD · 2026-06-25 · unverdicted · none · ref 30 · internal anchor
wav2tok 2.0 improves audio tokenization for query-by-example spoken term detection via staged training that first learns speaker-invariant representations then enforces pairwise token alignment.
DinoLink: A Token-Centric Representation Compression Framework for Bandwidth-Constrained Collaborative V2X Perception cs.CV · 2026-06-24 · unverdicted · none · ref 15 · 2 links · internal anchor
DinoLink uses saliency-aware token pruning plus residual vector quantization to cut V2X bitrate by 139x while reporting 32.8% mAP on nuScenes.
STAR-VAE: Structured Topology-Aware Regularization for Audio Reconstruction and Generation eess.AS · 2026-06-22 · unverdicted · none · ref 3 · internal anchor
STAR-VAE introduces topology-aware regularization to reshape VAE latent geometry for audio, claiming to resolve the Rate-Distortion-Regularity Trilemma and achieve SOTA reconstruction.
SARA: A Dual-Stream VAE for High-Fidelity Speech Generation via Integrating Semantic and Acoustic Representations cs.SD · 2026-06-10 · unverdicted · none · ref 13 · internal anchor
SARA is a dual-stream VAE that integrates semantic and acoustic streams to achieve high-fidelity reconstruction and natural zero-shot TTS without complex regularizers.
MOSS-Audio Technical Report cs.SD · 2026-06-01 · unverdicted · none · ref 59 · internal anchor
MOSS-Audio is an audio-language model using a 12.5 Hz encoder, DeepStack cross-layer injection, time markers, and an event-preserving annotation pipeline for unified audio understanding.
Afrispeech Semantics: Evaluating Audio Semantic Reasoning in Spoken Language Models Across Domains and Accents cs.CL · 2026-05-11 · unverdicted · none · ref 60 · internal anchor
Audio language models are benchmarked on five semantic and paralinguistic reasoning tasks to reveal limitations in handling spoken audio evidence, accent variation, and domain shifts.
Toward Native Multimodal Modeling: A Roadmap cs.CV · 2026-05-25 · unverdicted · none · ref 182 · internal anchor
A roadmap that defines architectural nativity for multimodal models and categorizes them into Multi-to-Text, Multi-to-Target, and Multi-to-Multi types while outlining an industrial pipeline toward unified transformer-based native multimodal modeling.
Expressive Prompting: Improving Emotion Intensity and Speaker Consistency in Zero-Shot TTS cs.SD · 2024-09-27 · unverdicted · none · ref 6 · internal anchor
A two-stage static-then-dynamic prompt selection strategy using prosodic features, LLM coherence scores, and similarity metrics improves emotion intensity and speaker consistency in zero-shot TTS.

High Fidelity Neural Audio Compression

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer