WavTTS is the first raw-waveform diffusion TTS model using DiT flow matching and multi-scale mel supervision that approaches SOTA latent zero-shot performance while beating prior end-to-end models.
hub Canonical reference
High Fidelity Neural Audio Compression
Canonical reference. 70% of citing Pith papers cite this work as background.
abstract
We introduce a state-of-the-art real-time, high-fidelity, audio codec leveraging neural networks. It consists in a streaming encoder-decoder architecture with quantized latent space trained in an end-to-end fashion. We simplify and speed-up the training by using a single multiscale spectrogram adversary that efficiently reduces artifacts and produce high-quality samples. We introduce a novel loss balancer mechanism to stabilize training: the weight of a loss now defines the fraction of the overall gradient it should represent, thus decoupling the choice of this hyper-parameter from the typical scale of the loss. Finally, we study how lightweight Transformer models can be used to further compress the obtained representation by up to 40%, while staying faster than real time. We provide a detailed description of the key design choices of the proposed model including: training objective, architectural changes and a study of various perceptual loss functions. We present an extensive subjective evaluation (MUSHRA tests) together with an ablation study for a range of bandwidths and audio domains, including speech, noisy-reverberant speech, and music. Our approach is superior to the baselines methods across all evaluated settings, considering both 24 kHz monophonic and 48 kHz stereophonic audio. Code and models are available at github.com/facebookresearch/encodec.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
MusicLM produces coherent multi-minute 24 kHz music from text prompts using hierarchical sequence-to-sequence modeling and outperforms prior systems in quality and text adherence.
HERMES provides a reusable hierarchical labeling substrate for pre-training data that reveals granularity-specific effects in data mixing rules during model training.
FlexiSLM is the first spoken language model supporting dynamic and controllable frame rates on speech input and output, outperforming fixed-rate 7B models at high quality and enabling faster inference at lower rates like 6.25 Hz.
DTM-Codec achieves better reconstruction quality and intelligibility than fixed-frame-rate neural speech codecs at matched total bitrate via dynamic token masking and Path Length Equalization for variable frame rates.
AudioCALM presents a continuous autoregressive framework with flow-matching prediction and A-MoME architecture that unifies speech, sound, and music generation while matching modality-specific state-of-the-art performance.
NAC adapts multi-scale RVQGAN audio codecs with kinematic-specific losses to produce ordered action tokens that yield lower reconstruction error and higher task success than prior tokenizers in VLA models.
A survey proposing an L0-L3 architectural hierarchy, T×I×R interaction ontology, and IDLE/LISTEN/SPEAK/WAIT/DUAL decision state machine for full-duplex spoken dialogue systems, documenting a realization gap between architectural potential and observed behavior due to training data limits.
HoliDubber introduces a patch-based autoregressive diffusion transformer for joint text-guided synthesis of speech and ambient audio in video dubbing, with a new benchmark showing outperformance over prior speech-only methods.
Introduces SARL benchmark showing pretrained audio encoders encode source-level spatial factors more readily than room-level factors, with patterns shaped by input configuration and training paradigm.
Neural reconstruction losses in VAEs reduce latent information content and produce more isotropic latent geometries with even uncertainty distribution.
Prompt Codebooks recasts automatic prompt optimization as discrete learning over a finite vocabulary of atomic natural-language instincts with per-instance routing, yielding up to +30.36 point gains over zero-shot and shorter prompts on six benchmarks.
Presents the ATTM grand challenge with efficiency and performance tracks for text-to-music generation using a public instrumental music dataset, evaluated via FAD, CLAP, a new CCS metric, and subjective tests.
CodecAttack perturbs audio in codec latent space with multi-bitrate EoT to achieve 85.5% average ASR on Opus-compressed Audio LLMs versus under 26% for waveform baselines, with transfer to MP3 and AAC.
BandTok tokenizes Mel-spectrograms as independent time-frequency band tokens from a single codebook and pairs it with 2D RoPE in an autoregressive model to improve music generation over residual multi-codebook tokenizers.
AffectCodec is an emotion-guided neural speech codec that preserves emotional cues during quantization while maintaining semantic fidelity and prosodic naturalness.
VITA-QinYu is the first expressive end-to-end spoken language model supporting role-playing and singing alongside conversation, trained on 15.8K hours of data and outperforming prior models on expressiveness and conversational benchmarks.
PairAlign learns compact variable-length token sequences for audio via self-alignment on paired content-preserving views, achieving 55% fewer archive tokens than VQ while preserving edit-distance retrieval at 12.71 tokens/s.
Introduces the Indic-CodecFake dataset for Indic codec deepfakes and SATYAM, a novel hyperbolic ALM that outperforms baselines through dual-stage semantic-prosodic fusion using Bhattacharya distance.
MusicRFM discovers interpretable concept directions in music model hidden states using RFM probes and injects them at inference to steer generation toward desired musical properties without retraining.
A new shared video-image tokenizer enables large language models to surpass diffusion models on standard visual generation benchmarks.
VALL-E is a neural codec language model trained on 60K hours of speech that performs zero-shot TTS, synthesizing natural speech that matches an unseen speaker's voice, emotion, and environment from a 3-second prompt.
Introduces AWM adaptive attack using two-stage optimization and distribution estimation to bypass audio watermark detectors with low detection rates on voice datasets.
Defines ECFD task, releases ECF dataset, demonstrates poor generalization of prior detectors to elderly speech, and introduces BONSAI fusion of LanguageBind and ImageBind achieving 1.66% average EER.
citing papers explorer
-
AudioPaLM: A Large Language Model That Can Speak and Listen
AudioPaLM unifies PaLM-2 and AudioLM to outperform prior systems on speech translation while enabling zero-shot speech-to-text for many unseen language pairs and voice transfer from short prompts.
-
FlowTTS-GRPO: Online Reinforcement Learning with Multi-Objective Reward Optimization for Flow-Matching Based Text-to-Speech
FlowTTS-GRPO applies online RL with weighted multi-objective rewards to flow-matching TTS models via ODE-to-SDE conversion, reporting gains in speaker similarity and perceptual quality on CosyVoice 3.0 and F5-TTS.
-
AugCodec: A Low-Bitrate Disentangled Neural Speech Codec via Data Augmentation
AugCodec disentangles speech into semantic, speaker, and prosody tokens via tailored data augmentations, achieving 12.5 Hz operation with three streams and outperforming prior codecs on LibriSpeech reconstruction and disentanglement metrics.
-
Which Speech Representation Better Matches Text-Native Reasoning? A Study of Speech-Text Alignment on Frame Rate and Representation
Empirical sweep finds 4.17 Hz frame rate plus intermediate-layer alignment optimal for speech QA under frozen text LLM backbone.
-
Feature-Aligned Speech Watermarking for Robustness to Reconstruction Distortions
Feature-aligned watermarking embeds a codec-generated pseudo-speech signal into the spectrogram to raise robustness against reconstruction models while keeping imperceptibility comparable to prior methods.
-
ContextCodec: Content-Focused Context Guidance for Ultra-Low Bitrate Speech Coding
ContextCodec uses a dual-branch encoder with CLIP-style contrastive training on phoneme-aligned context features plus autoregressive refinement to improve quality-intelligibility at bitrates down to 500 bps.
-
F3-Tokenizer: Taming Audio Autoencoder Latents for Understanding and Generation
F3-Tokenizer adapts audio autoencoder latents with noise-regularized bottleneck (channel normalization and stochastic perturbation) and a representation encoder (RQ-MTP plus frozen-LLM supervision) to support both high-dimensional understanding representations and normalized continuous generation ta
-
Wavelet as Tokenizer: Preliminary Results on a Shared Wavelet Token Schema for Natural Signals
A continuous-token model with shared Haar wavelet coefficients reports 39.92 dB audio, 29.37 dB image, and 23.93 dB video PSNR on three datasets and shows energy-based selection outperforms uniform selection by roughly 16 dB.
-
Multimodal Music Recommendation System using LLMs
Extending E4SRec with multimodal content features on LastFM-1K yields up to 95% Recall and 79% NDCG gains over ID-only baselines, though naive fusion does not always improve results.
-
Thinking-while-speaking: A Controlled, Interleaved Reasoning Method for Real-Time Speech Generation
InterRS enables real-time speech generation with interleaved reasoning via a controlled data pipeline, interleaved SFT, and RL using TA-Balance and Linguistic Quality rewards, yielding 13% gains on math and logic benchmarks.
-
A Survey of Automated Presentation Coaching: Systems, Methods, and Open Challenges
A survey introduces a five-dimensional taxonomy for automated presentation coaching systems, maps existing work onto it, and identifies open challenges including data scarcity and accent fairness.
-
Drum Synthesis from Expressive Drum Grids via Neural Audio Codecs
A Transformer predicts tokens from neural audio codecs (EnCodec, DAC, X-Codec) to convert expressive drum grids into audio, trained and evaluated on the E-GMD dataset using objective metrics.
-
Diffusion Reconstruction towards Generalizable Audio Deepfake Detection
Diffusion reconstruction creates hard samples for audio deepfake detection training, and when paired with feature aggregation and RACL, it reduces average EER versus baselines.
-
Adopting State-of-the-Art Pretrained Audio Representations for Music Recommender Systems
Pretrained audio models show large performance gaps between standard MIR tasks and music recommendation in both hot and cold-start settings.
-
Voxtral TTS
Voxtral TTS produces expressive multilingual speech from 3-second reference audio with a hybrid autoregressive-plus-flow-matching architecture and a new VQ-FSQ tokenizer, achieving 68.4% win rate over ElevenLabs in human evaluations.
-
Expectation and Acoustic Neural Network Representations Enhance Music Identification from Brain Activity
Separating acoustic and expectation ANN representations as teacher targets improves EEG music identification beyond baselines and seed ensembles.
-
A Simple Method to Enhance Pre-trained Language Models with Speech Tokens for Classification
Lasso-selected speech tokens enhance text LLMs for multimodal classification by reducing long audio sequences to task-relevant features via self-supervised adaptation.
-
XAttnMark: Learning Robust Audio Watermarking with Cross-Attention
XAttnMark is a new neural audio watermarking method using partial parameter sharing, cross-attention for message retrieval, temporal conditioning, and a psychoacoustic TF masking loss that reports state-of-the-art detection and attribution robustness.
-
Movie Gen: A Cast of Media Foundation Models
A 30B-parameter transformer and related models generate high-quality videos and audio, claiming state-of-the-art results on text-to-video, video editing, personalization, and audio generation tasks.
-
F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching
F5-TTS generates natural speech from text via flow matching on DiT with simple text padding, ConvNeXt refinement, and sway sampling, trained on 100K hours multilingual data.
-
wav2tok 2.0: Scalable Audio Tokenization Maintaining Explicit Pairwise Token Alignment for Efficient Audio Retrieval
wav2tok 2.0 improves audio tokenization for query-by-example spoken term detection via staged training that first learns speaker-invariant representations then enforces pairwise token alignment.
-
DinoLink: A Token-Centric Representation Compression Framework for Bandwidth-Constrained Collaborative V2X Perception
DinoLink uses saliency-aware token pruning plus residual vector quantization to cut V2X bitrate by 139x while reporting 32.8% mAP on nuScenes.
-
STAR-VAE: Structured Topology-Aware Regularization for Audio Reconstruction and Generation
STAR-VAE introduces topology-aware regularization to reshape VAE latent geometry for audio, claiming to resolve the Rate-Distortion-Regularity Trilemma and achieve SOTA reconstruction.
-
SARA: A Dual-Stream VAE for High-Fidelity Speech Generation via Integrating Semantic and Acoustic Representations
SARA is a dual-stream VAE that integrates semantic and acoustic streams to achieve high-fidelity reconstruction and natural zero-shot TTS without complex regularizers.
-
MOSS-Audio Technical Report
MOSS-Audio is an audio-language model using a 12.5 Hz encoder, DeepStack cross-layer injection, time markers, and an event-preserving annotation pipeline for unified audio understanding.
-
Afrispeech Semantics: Evaluating Audio Semantic Reasoning in Spoken Language Models Across Domains and Accents
Audio language models are benchmarked on five semantic and paralinguistic reasoning tasks to reveal limitations in handling spoken audio evidence, accent variation, and domain shifts.
-
Toward Native Multimodal Modeling: A Roadmap
A roadmap that defines architectural nativity for multimodal models and categorizes them into Multi-to-Text, Multi-to-Target, and Multi-to-Multi types while outlining an industrial pipeline toward unified transformer-based native multimodal modeling.
-
Expressive Prompting: Improving Emotion Intensity and Speaker Consistency in Zero-Shot TTS
A two-stage static-then-dynamic prompt selection strategy using prosodic features, LLM coherence scores, and similarity metrics improves emotion intensity and speaker consistency in zero-shot TTS.