Spectral codecs: Improving non-autoregressive speech synthesis with spectrogram-based audio codecs

· 2024 · arXiv 2406.05298

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

representative citing papers

Modeling Music as a Time-Frequency Image: A 2D Tokenizer for Music Generation

cs.SD · 2026-05-15 · unverdicted · novelty 7.0

BandTok tokenizes Mel-spectrograms as independent time-frequency band tokens from a single codebook and pairs it with 2D RoPE in an autoregressive model to improve music generation over residual multi-codebook tokenizers.

Benchmarking Neural Speech Compression from a Rate-Distortion Perspective

eess.AS · 2026-06-10 · unverdicted · novelty 6.0

ECC integrates hyperprior side information, channel-wise context, latent residual prediction, temporal modeling, and entropy skip into a learned entropy model, yielding 39.9% and 76.3% average BD-rate reductions on ViSQOL and PESQ over baselines.

Ultra-Low-Bitrate Mel-Spectrogram-based Neural Speech Coding with Flow-Matching-based Refinement and Vocoding-driven Reconstruction

eess.AS · 2026-05-25 · unverdicted · novelty 6.0

FMelCodec is a three-stage mel-spectrogram codec using 640x VQ compression, conditional flow matching refinement, and HiFi-GAN reconstruction that reports higher quality than prior methods at 250 bps for 16 kHz speech.

citing papers explorer

Showing 3 of 3 citing papers after filters.

Modeling Music as a Time-Frequency Image: A 2D Tokenizer for Music Generation cs.SD · 2026-05-15 · unverdicted · none · ref 13
BandTok tokenizes Mel-spectrograms as independent time-frequency band tokens from a single codebook and pairs it with 2D RoPE in an autoregressive model to improve music generation over residual multi-codebook tokenizers.
Benchmarking Neural Speech Compression from a Rate-Distortion Perspective eess.AS · 2026-06-10 · unverdicted · none · ref 26
ECC integrates hyperprior side information, channel-wise context, latent residual prediction, temporal modeling, and entropy skip into a learned entropy model, yielding 39.9% and 76.3% average BD-rate reductions on ViSQOL and PESQ over baselines.
Ultra-Low-Bitrate Mel-Spectrogram-based Neural Speech Coding with Flow-Matching-based Refinement and Vocoding-driven Reconstruction eess.AS · 2026-05-25 · unverdicted · none · ref 17
FMelCodec is a three-stage mel-spectrogram codec using 640x VQ compression, conditional flow matching refinement, and HiFi-GAN reconstruction that reports higher quality than prior methods at 250 bps for 16 kHz speech.

Spectral codecs: Improving non-autoregressive speech synthesis with spectrogram-based audio codecs

fields

years

verdicts

representative citing papers

citing papers explorer