pith. sign in

hub Mixed citations

High Fidelity Neural Audio Compression

Mixed citation behavior. Most common role is background (67%).

53 Pith papers citing it
Background 67% of classified citations
abstract

We introduce a state-of-the-art real-time, high-fidelity, audio codec leveraging neural networks. It consists in a streaming encoder-decoder architecture with quantized latent space trained in an end-to-end fashion. We simplify and speed-up the training by using a single multiscale spectrogram adversary that efficiently reduces artifacts and produce high-quality samples. We introduce a novel loss balancer mechanism to stabilize training: the weight of a loss now defines the fraction of the overall gradient it should represent, thus decoupling the choice of this hyper-parameter from the typical scale of the loss. Finally, we study how lightweight Transformer models can be used to further compress the obtained representation by up to 40%, while staying faster than real time. We provide a detailed description of the key design choices of the proposed model including: training objective, architectural changes and a study of various perceptual loss functions. We present an extensive subjective evaluation (MUSHRA tests) together with an ablation study for a range of bandwidths and audio domains, including speech, noisy-reverberant speech, and music. Our approach is superior to the baselines methods across all evaluated settings, considering both 24 kHz monophonic and 48 kHz stereophonic audio. Code and models are available at github.com/facebookresearch/encodec.

hub tools

citation-role summary

background 6 method 2 other 1

citation-polarity summary

clear filters

representative citing papers

MusicLM: Generating Music From Text

cs.SD · 2023-01-26 · conditional · novelty 8.0

MusicLM produces coherent multi-minute 24 kHz music from text prompts using hierarchical sequence-to-sequence modeling and outperforms prior systems in quality and text adherence.

How Neural Losses Shape VAE Latents

cs.LG · 2026-05-30 · unverdicted · novelty 7.0

Neural reconstruction losses in VAEs reduce latent information content and produce more isotropic latent geometries with even uncertainty distribution.

Codec-Robust Attacks on Audio LLMs

cs.SD · 2026-05-19 · unverdicted · novelty 7.0 · 2 refs

CodecAttack perturbs audio in codec latent space with multi-bitrate EoT to achieve 85.5% average ASR on Opus-compressed Audio LLMs versus under 26% for waveform baselines, with transfer to MP3 and AAC.

Taming Audio VAEs via Target-KL Regularization

cs.SD · 2026-05-16 · unverdicted · novelty 6.0

The paper introduces target-KL regularization to train audio VAEs at specific bitrates, enabling rate-distortion curves and comparison to discrete audio codecs for improved text-to-sound generation.

LLM-Codec: Neural Audio Codec Meets Language Model Objectives

cs.SD · 2026-04-20 · unverdicted · novelty 6.0

LLM-Codec augments audio codec training with multi-step token prediction and contrastive semantic alignment to improve both waveform reconstruction and autoregressive predictability for speech language models.

HCFD: A Benchmark for Audio Deepfake Detection in Healthcare

eess.AS · 2026-04-19 · unverdicted · novelty 6.0

HCFD is a new pathology-aware benchmark and dataset for codec-fake audio detection in healthcare, with PHOENIX-Mamba achieving up to 97% accuracy by modeling fakes as modes in hyperbolic space.

Efficient Training for Cross-lingual Speech Language Models

cs.CL · 2026-04-13 · unverdicted · novelty 6.0

CSLM achieves cross-modal and cross-lingual alignment in speech LLMs via continual pre-training on discrete tokens and speech-text interleaved instruction tuning, enabling scalability without massive speech datasets.

citing papers explorer

Showing 8 of 8 citing papers after filters.

  • VITA-QinYu: Expressive Spoken Language Model for Role-Playing and Singing cs.CL · 2026-05-07 · unverdicted · none · ref 34 · internal anchor

    VITA-QinYu is the first expressive end-to-end spoken language model supporting role-playing and singing alongside conversation, trained on 15.8K hours of data and outperforming prior models on expressiveness and conversational benchmarks.

  • Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers cs.CL · 2023-01-05 · unverdicted · none · ref 6 · internal anchor

    VALL-E is a neural codec language model trained on 60K hours of speech that performs zero-shot TTS, synthesizing natural speech that matches an unseen speaker's voice, emotion, and environment from a 3-second prompt.

  • Efficient Training for Cross-lingual Speech Language Models cs.CL · 2026-04-13 · unverdicted · none · ref 1 · internal anchor

    CSLM achieves cross-modal and cross-lingual alignment in speech LLMs via continual pre-training on discrete tokens and speech-text interleaved instruction tuning, enabling scalability without massive speech datasets.

  • Step-Audio 2 Technical Report cs.CL · 2025-07-22 · unverdicted · none · ref 15 · internal anchor

    Step-Audio 2 integrates a latent audio encoder, reasoning-centric reinforcement learning, and discrete audio token generation into language modeling to deliver state-of-the-art performance on audio understanding and conversational benchmarks.

  • Better & Faster Large Language Models via Multi-token Prediction cs.CL · 2024-04-30 · conditional · none · ref 4 · internal anchor

    Multi-token prediction training yields higher sample efficiency, better benchmark scores on code generation, and up to 3x faster inference than standard next-token prediction for LLMs.

  • AudioPaLM: A Large Language Model That Can Speak and Listen cs.CL · 2023-06-22 · unverdicted · none · ref 18 · 2 links · internal anchor

    AudioPaLM unifies PaLM-2 and AudioLM to outperform prior systems on speech translation while enabling zero-shot speech-to-text for many unseen language pairs and voice transfer from short prompts.

  • Thinking-while-speaking: A Controlled, Interleaved Reasoning Method for Real-Time Speech Generation cs.CL · 2026-05-20 · unverdicted · none · ref 26 · internal anchor

    InterRS enables real-time speech generation with interleaved reasoning via a controlled data pipeline, interleaved SFT, and RL using TA-Balance and Linguistic Quality rewards, yielding 13% gains on math and logic benchmarks.

  • A Simple Method to Enhance Pre-trained Language Models with Speech Tokens for Classification cs.CL · 2025-12-08 · unverdicted · none · ref 11 · internal anchor

    Lasso-selected speech tokens enhance text LLMs for multimodal classification by reducing long audio sequences to task-relevant features via self-supervised adaptation.