MELD: Mel-Spectrogram-Based Speech Language Modeling with Discrete Latent Variables

Alexandre Mourachko; Duc Le; Gil Keren; Hao Tang; Jay Mahadeokar; Ozlem Kalinli; Sung-Lin Yeh; Wei Zhou; Zhong Meng

arxiv: 2605.29859 · v1 · pith:TDAC27ZWnew · submitted 2026-05-28 · 📡 eess.AS · cs.CL

MELD: Mel-Spectrogram-Based Speech Language Modeling with Discrete Latent Variables

Sung-Lin Yeh , Wei Zhou , Gil Keren , Duc Le , Zhong Meng , Hao Tang , Jay Mahadeokar , Ozlem Kalinli

show 1 more author

Alexandre Mourachko

This is my paper

Pith reviewed 2026-06-29 05:33 UTC · model grok-4.3

classification 📡 eess.AS cs.CL

keywords mel-spectrogramspeech language modeldiscrete latent variablesjoint optimizationzero-shot TTSzero-shot STTautoregressive modeling

0 comments

The pith

MELD jointly optimizes encoder and autoregressive model on mel spectrograms via discrete latents, improving zero-shot TTS and STT while reducing silence and omissions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Recent speech language models train encoders separately from their autoregressive components, so the extracted features may not align well with tasks like text-to-speech or speech-to-text. MELD places a discrete latent variable model directly on mel spectrograms and trains the encoder together with the autoregressive predictor in a single objective. This joint training produces measurable gains over both codec-based and other mel-spectrogram baselines on zero-shot TTS and STT. The same change also cuts two frequent autoregressive artifacts: stretches of unwanted silence and dropped words.

Core claim

MELD shows that a discrete latent variable model on mel spectrograms permits joint optimization of the encoder and the downstream autoregressive speech language model; the resulting representations outperform those from separately trained encoders on zero-shot TTS and STT and simultaneously suppress prolonged silence generation and word omissions.

What carries the argument

Discrete latent variable model on mel spectrograms that couples encoder training to the autoregressive language-model objective.

If this is right

Zero-shot TTS and STT performance rises relative to both codec and separate-encoder mel baselines.
Autoregressive mel-spectrogram generation produces fewer extended silences.
Word-omission errors decline in the same generation setting.
The joint objective removes the need for a separate encoder pre-training stage before autoregressive modeling.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The result implies that the usual two-stage pipeline (encoder pre-training then language-model training) is a performance bottleneck rather than an inherent requirement.
Joint optimization may generalize to other spectrogram-based or continuous audio representations where separate encoders are currently standard.

Load-bearing premise

Separately trained encoders produce representations that are meaningfully suboptimal for autoregressive TTS and STT, and joint optimization can close the gap without creating new convergence problems.

What would settle it

A controlled experiment in which a separately trained encoder matches or exceeds MELD on zero-shot TTS and STT word-error and similarity metrics while showing equal or lower rates of silence and omission errors.

read the original abstract

Recent speech language models rely on encoders that are optimized separately from autoregressive models. Since these encoders are unaware of the downstream objectives, the extracted representations may not be optimal for downstream tasks. To address this limitation, we introduce a discrete latent variable model on mel spectrograms that jointly optimizes the encoder and the speech language model. Joint optimization not only brings improvements over codec-based and other mel-spectrogram-based baselines on zero-shot Text-to-Speech (TTS) and Speech-to-Text (STT) tasks, but also effectively alleviates common issues in autoregressive mel-spectrogram modeling, such as prolonged silence generation and word omissions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MELD claims joint encoder plus autoregressive LM training on discrete mel latents beats separate pipelines and cuts silence/omission artifacts, but the abstract supplies no metrics, ablations, or protocol so the gains remain unverified.

read the letter

The core idea is joint optimization of the encoder and the speech LM on discrete mel-spectrogram latents instead of training them separately. The paper positions this as a direct response to encoders that ignore the downstream autoregressive objective, and it asserts gains on zero-shot TTS and STT plus fewer prolonged silences and word omissions.

What is actually new is the explicit joint-training recipe for mel-based discrete models. Prior work often kept the encoder fixed or trained it with a separate objective; tying it to the LM loss is a straightforward extension that could matter if the representations really shift in useful ways.

The paper does a clean job stating the limitation and the proposed fix. The motivation is easy to follow and matches known issues in autoregressive mel modeling.

The soft spot is obvious and central: the abstract contains no numbers, no baselines, no ablation tables, and no description of the discretization, loss weighting, or training schedule. Every performance claim therefore sits on an unevaluated assertion. If the full paper shows solid comparisons and controls for the joint objective, that changes the picture; without them the soundness is low. Minor issues like missing details on how the discrete latents are sampled or stabilized would be easy to fix in revision.

This is for people already working on speech language models who care about mel-spectrogram routes versus neural codecs. A reader in that subfield could extract the motivation and the joint-training sketch, but the practical value hinges on the missing experiments.

I would send it to peer review. The topic is live, the proposed change is simple enough to test, and a referee can check whether the results actually support the claims once the full manuscript is in hand.

Referee Report

2 major / 0 minor

Summary. The manuscript introduces MELD, a discrete latent variable model operating on mel spectrograms. It jointly optimizes an encoder with an autoregressive speech language model, claiming this yields improvements over codec-based and other mel-spectrogram baselines on zero-shot TTS and STT while reducing artifacts such as prolonged silence and word omissions.

Significance. If the joint-optimization result were demonstrated with rigorous evidence, the work could meaningfully advance speech language modeling by showing that task-aware encoder training mitigates suboptimal representations. The current manuscript supplies no such evidence, so the potential significance cannot be assessed.

major comments (2)

[Abstract] Abstract: the central empirical claim (performance gains on TTS/STT plus artifact reduction) is stated without any metrics, baselines, ablation tables, evaluation protocol, or quantitative results, making the claim impossible to evaluate.
[Abstract] Abstract: no equations, model architecture, discretization procedure, or joint objective are defined, so the technical mechanism underlying the claimed improvements cannot be examined.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their feedback. We agree that the abstract would benefit from additional specificity to better support evaluation of the claims. We address the two major comments below and will revise the abstract in the resubmission.

read point-by-point responses

Referee: [Abstract] Abstract: the central empirical claim (performance gains on TTS/STT plus artifact reduction) is stated without any metrics, baselines, ablation tables, evaluation protocol, or quantitative results, making the claim impossible to evaluate.

Authors: We acknowledge that the provided abstract states the performance claims at a high level without quantitative support. The full manuscript contains the requested details (metrics, baselines, ablation studies, and evaluation protocols) in the Experiments section. To address the concern directly, we will revise the abstract to incorporate key quantitative results and reference the evaluation protocol. revision: yes
Referee: [Abstract] Abstract: no equations, model architecture, discretization procedure, or joint objective are defined, so the technical mechanism underlying the claimed improvements cannot be examined.

Authors: The abstract is intentionally concise and high-level. The model architecture, discretization procedure for the latent variables, and the joint optimization objective (with equations) are defined in Section 3 of the manuscript. We will revise the abstract to include a brief description of the joint encoder-LM optimization mechanism. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The abstract and available text describe an empirical model for joint optimization of an encoder and autoregressive speech LM on discrete mel latents, with performance claims on TTS/STT tasks. No equations, first-principles derivations, fitted parameters renamed as predictions, or self-citation chains appear. The central claims rest on external empirical benchmarks rather than reducing to inputs by construction, making the work self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no technical sections, equations, or implementation details are provided from which free parameters, axioms, or invented entities can be extracted.

pith-pipeline@v0.9.1-grok · 5661 in / 1144 out tokens · 29099 ms · 2026-06-29T05:33:06.120402+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

LLM can Read Spectrogram: Encoder-free Speech-Language Modeling
eess.AS 2026-06 unverdicted novelty 6.0

Mel-LLM shows that LLMs can process Mel spectrograms directly for competitive ASR performance without a dedicated speech encoder, with limited degradation versus encoder-based versions when using multimodal initializa...

Reference graph

Works this paper leans on

11 extracted references · 8 canonical work pages · cited by 1 Pith paper · 3 internal anchors

[1]

On The Landscape of Spoken Language Models: A Comprehensive Survey

Siddhant Arora, Kai-Wei Chang, Chung-Ming Chien, Yifan Peng, Haibin Wu, Yossi Adi, Emmanuel Dupoux, Hung-Yi Lee, Karen Livescu, and Shinji Watanabe. On the landscape of spoken language models: A comprehensive survey. arXiv preprint arXiv:2504.08528,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

dMel: Speech tokenization made simple.arXiv preprint arXiv:2407.15835,

Richard He Bai, Tatiana Likhomanenko, Ruixiang Zhang, Zijin Gu, Zakaria Aldeneh, and Navdeep Jaitly. dMel: Speech tokenization made simple.arXiv preprint arXiv:2407.15835,

work page arXiv
[3]

Vall-e 2: Neural codec language models are human parity zero-shot text to speech synthesizers,

Sanyuan Chen, Shujie Liu, Long Zhou, Yanqing Liu, Xu Tan, Jinyu Li, Sheng Zhao, Yao Qian, and Furu Wei. Vall-E 2: Neural codec language models are human parity zero-shot text to speech synthesizers.arXiv preprint arXiv:2406.05370,

work page arXiv
[4]

Moshi: a speech-text foundation model for real-time dialogue

Alexandre Défossez, Laurent Mazaré, Manu Orsini, Amélie Royer, Patrick Pérez, Hervé Jégou, Edouard Grave, and Neil Zeghidour. Moshi: a speech-text foundation model for real-time dialogue.arXiv preprint arXiv:2410.00037,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Multimodal latent language modeling with next-token diffusion.arXiv preprint arXiv:2412.08635,

Yutao Sun, Hangbo Bao, Wenhui Wang, Zhiliang Peng, Li Dong, Shaohan Huang, Jianyong Wang, and Furu Wei. Multimodal latent language modeling with next-token diffusion.arXiv preprint arXiv:2412.08635,

work page arXiv
[6]

Continuous speech synthesis using per-token latent diffusion.arXiv preprint arXiv:2410.16048,

Arnon Turetzky, Nimrod Shabtay, Slava Shechtman, Hagai Aronowitz, David Haws, Ron Hoory, and Avihu Dekel. Continuous speech synthesis using per-token latent diffusion.arXiv preprint arXiv:2410.16048,

work page arXiv
[7]

Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers

Chengyi Wang, Sanyuan Chen, Yu Wu, Ziqiang Zhang, Long Zhou, Shujie Liu, Zhuo Chen, Yanqing Liu, Huaming Wang, Jinyu Li, et al. Neural codec language models are zero-shot text to speech synthesizers.arXiv preprint arXiv:2301.02111,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Voxcpm: Tokenizer-free tts for context-aware speech generation and true-to-life voice cloning.arXiv preprint arXiv:2509.24650, 2025

Yixuan Zhou, Guoyang Zeng, Xin Liu, Xiang Li, Renjie Yu, Ziyang Wang, Runchuan Ye, Weiyue Sun, Jiancheng Gui, Kehan Li, et al. Voxcpm: Tokenizer-free tts for context-aware speech generation and true-to-life voice cloning.arXiv preprint arXiv:2509.24650,

work page arXiv
[9]

We also experiment with longer training steps, and we do not see notable improvements

The learning rate is linearly warmed up for 1k steps to5×10−4, held constant for 100k steps, and then linearly decayed over the final 100k steps. We also experiment with longer training steps, and we do not see notable improvements. A.3 Mel-spectrograms In line with the mel-spectrogram settings from the pre-trained HiFi-GAN (Kong et al., 2020), we extract...

2020
[10]

continuous

consists of four terms, a regression LossLreg, KL Divergence LossLKL, stop predictionLstop, and Spectrogram Flux lossLflux. The losses are combined with different weights. The regression loss is related to the MSE loss in our objective (8) for reconstruction but with extra L1 terms. Their reconstruction loss only depends onzt, while we have additional dep...

2025
[11]

abnormal

to refine predicted mel-spectrograms. Though the exact architectures of the convolutional layers are not presented in Meng et al. (2025). We also experience with extending training MELLE for up to 400k steps, while the proposed approach and Codec-LM converges in 100k steps. We suspect the VAD step is critical in the original MELLE to prevent models from k...

2025

[1] [1]

On The Landscape of Spoken Language Models: A Comprehensive Survey

Siddhant Arora, Kai-Wei Chang, Chung-Ming Chien, Yifan Peng, Haibin Wu, Yossi Adi, Emmanuel Dupoux, Hung-Yi Lee, Karen Livescu, and Shinji Watanabe. On the landscape of spoken language models: A comprehensive survey. arXiv preprint arXiv:2504.08528,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

dMel: Speech tokenization made simple.arXiv preprint arXiv:2407.15835,

Richard He Bai, Tatiana Likhomanenko, Ruixiang Zhang, Zijin Gu, Zakaria Aldeneh, and Navdeep Jaitly. dMel: Speech tokenization made simple.arXiv preprint arXiv:2407.15835,

work page arXiv

[3] [3]

Vall-e 2: Neural codec language models are human parity zero-shot text to speech synthesizers,

Sanyuan Chen, Shujie Liu, Long Zhou, Yanqing Liu, Xu Tan, Jinyu Li, Sheng Zhao, Yao Qian, and Furu Wei. Vall-E 2: Neural codec language models are human parity zero-shot text to speech synthesizers.arXiv preprint arXiv:2406.05370,

work page arXiv

[4] [4]

Moshi: a speech-text foundation model for real-time dialogue

Alexandre Défossez, Laurent Mazaré, Manu Orsini, Amélie Royer, Patrick Pérez, Hervé Jégou, Edouard Grave, and Neil Zeghidour. Moshi: a speech-text foundation model for real-time dialogue.arXiv preprint arXiv:2410.00037,

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

Multimodal latent language modeling with next-token diffusion.arXiv preprint arXiv:2412.08635,

Yutao Sun, Hangbo Bao, Wenhui Wang, Zhiliang Peng, Li Dong, Shaohan Huang, Jianyong Wang, and Furu Wei. Multimodal latent language modeling with next-token diffusion.arXiv preprint arXiv:2412.08635,

work page arXiv

[6] [6]

Continuous speech synthesis using per-token latent diffusion.arXiv preprint arXiv:2410.16048,

Arnon Turetzky, Nimrod Shabtay, Slava Shechtman, Hagai Aronowitz, David Haws, Ron Hoory, and Avihu Dekel. Continuous speech synthesis using per-token latent diffusion.arXiv preprint arXiv:2410.16048,

work page arXiv

[7] [7]

Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers

Chengyi Wang, Sanyuan Chen, Yu Wu, Ziqiang Zhang, Long Zhou, Shujie Liu, Zhuo Chen, Yanqing Liu, Huaming Wang, Jinyu Li, et al. Neural codec language models are zero-shot text to speech synthesizers.arXiv preprint arXiv:2301.02111,

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

Voxcpm: Tokenizer-free tts for context-aware speech generation and true-to-life voice cloning.arXiv preprint arXiv:2509.24650, 2025

Yixuan Zhou, Guoyang Zeng, Xin Liu, Xiang Li, Renjie Yu, Ziyang Wang, Runchuan Ye, Weiyue Sun, Jiancheng Gui, Kehan Li, et al. Voxcpm: Tokenizer-free tts for context-aware speech generation and true-to-life voice cloning.arXiv preprint arXiv:2509.24650,

work page arXiv

[9] [9]

We also experiment with longer training steps, and we do not see notable improvements

The learning rate is linearly warmed up for 1k steps to5×10−4, held constant for 100k steps, and then linearly decayed over the final 100k steps. We also experiment with longer training steps, and we do not see notable improvements. A.3 Mel-spectrograms In line with the mel-spectrogram settings from the pre-trained HiFi-GAN (Kong et al., 2020), we extract...

2020

[10] [10]

continuous

consists of four terms, a regression LossLreg, KL Divergence LossLKL, stop predictionLstop, and Spectrogram Flux lossLflux. The losses are combined with different weights. The regression loss is related to the MSE loss in our objective (8) for reconstruction but with extra L1 terms. Their reconstruction loss only depends onzt, while we have additional dep...

2025

[11] [11]

abnormal

to refine predicted mel-spectrograms. Though the exact architectures of the convolutional layers are not presented in Meng et al. (2025). We also experience with extending training MELLE for up to 400k steps, while the proposed approach and Codec-LM converges in 100k steps. We suspect the VAD step is critical in the original MELLE to prevent models from k...

2025