arxiv: 2604.08562 · v1 · submitted 2026-03-17 · 💻 cs.CL · cs.AI· cs.SD· eess.AS

Recognition: no theorem link

Neural networks for Text-to-Speech evaluation

Ilya Trofimenko , David Kocharyan , Aleksandr Zaitsev , Pavel Repnikov , Mark Levin , Nikita Shevtsov

Authors on Pith no claims yet

Pith reviewed 2026-05-15 09:43 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.SDeess.AS

keywords text-to-speech evaluationMOS predictionneural networksspeech quality assessmentSOMOS datasetWhisperBertNeuralSBS

0 comments

The pith

Neural models predict TTS quality with RMSE 0.40, beating the 0.62 human inter-rater baseline.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper develops specialized neural networks to stand in for human listeners when judging text-to-speech output quality. The models handle both absolute scoring via Mean Opinion Score and relative side-by-side comparisons, trained and tested primarily on the SOMOS dataset. The strongest results come from an enhanced MOSNet and a new WhisperBert stacking ensemble that reaches lower error than the typical disagreement between human raters. Ablation experiments show that simple cross-attention fusion of audio and text features hurts performance, while dedicated metric-learning approaches succeed where zero-shot large language models do not.

Core claim

Dedicated neural architectures can approximate expert human judgments of TTS audio more consistently than humans agree with one another. On the SOMOS dataset the best MOS predictors achieve an RMSE of approximately 0.40 while the human inter-rater RMSE baseline is 0.62; NeuralSBS reaches 73.7 percent accuracy on relative comparisons. Ensemble stacking of Whisper audio features with BERT text embeddings outperforms direct latent fusion, and negative results are reported for SpeechLM-based models and zero-shot LLM evaluators.

What carries the argument

WhisperBert, a multimodal stacking ensemble that combines Whisper audio features and BERT textual embeddings through weak learners, together with custom sequence-length batching applied to MOSNet.

If this is right

Automated evaluation at scale becomes practical for iterative TTS development without repeated human listening panels.
Ensemble stacking is more effective than naive cross-attention fusion when combining audio and text modalities.
Zero-shot LLM evaluators and certain SpeechLM architectures fail to match the accuracy of models trained specifically for metric prediction.
Ablation results confirm that careful architectural choices, rather than larger backbones alone, drive the reported gains on the SOMOS benchmark.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

These predictors could be inserted directly into TTS training loops to optimize perceptual quality without external human feedback.
Performance is likely to degrade on languages or recording conditions absent from the training distribution, pointing to the need for more diverse corpora.
The same stacking approach may transfer to evaluating other generative audio tasks such as music synthesis or environmental sound generation.

Load-bearing premise

Models trained on existing datasets such as SOMOS will maintain performance on new TTS architectures, languages, or acoustic conditions not seen during training.

What would settle it

Testing the trained models on TTS samples drawn from a different language family or from synthesis systems that use unseen vocoders or acoustic conditions would show whether RMSE stays near 0.40 or rises toward the human baseline.

Figures

Figures reproduced from arXiv: 2604.08562 by Aleksandr Zaitsev, David Kocharyan, Ilya Trofimenko, Mark Levin, Nikita Shevtsov, Pavel Repnikov.

**Figure 2.** Figure 2: The NeuralSBSBert architecture. It extends NeuralSBS by incorporating textual features from BERT via a [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: The enhanced MOSNet architecture: Audio features pass through 1D-CNNs and a BLSTM before global average pooling. Padding is explicitly masked during the loss calculation [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 5.** Figure 5: The WhisperBert multimodal stacking architecture. Independent weak learners process concatenated audio [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗

read the original abstract

Ensuring that Text-to-Speech (TTS) systems deliver human-perceived quality at scale is a central challenge for modern speech technologies. Human subjective evaluation protocols such as Mean Opinion Score (MOS) and Side-by-Side (SBS) comparisons remain the de facto gold standards, yet they are expensive, slow, and sensitive to pervasive assessor biases. This study addresses these barriers by formulating, and implementing a suite of novel neural models designed to approximate expert judgments in both relative (SBS) and absolute (MOS) settings. For relative assessment, we propose NeuralSBS, a HuBERT-backed model achieving 73.7% accuracy (on SOMOS dataset). For absolute assessment, we introduce enhancements to MOSNet using custom sequence-length batching, as well as WhisperBert, a multimodal stacking ensemble that combines Whisper audio features and BERT textual embeddings via weak learners. Our best MOS models achieve a Root Mean Square Error (RMSE) of ~0.40, significantly outperforming the human inter-rater RMSE baseline of 0.62. Furthermore, our ablation studies reveal that naively fusing text via cross-attention can degrade performance, highlighting the effectiveness of ensemble-based stacking over direct latent fusion. We additionally report negative results with SpeechLM-based architectures and zero-shot LLM evaluators (Qwen2-Audio, Gemini 2.5 flash preview), reinforcing the necessity of dedicated metric learning frameworks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

WhisperBert ensemble hits 0.40 RMSE on SOMOS MOS and beats the human baseline, but all results stay inside that single dataset.

read the letter

The standout part is the concrete numbers: NeuralSBS reaches 73.7% accuracy on side-by-side judgments with a HuBERT backbone, and the WhisperBert stacking model gets MOS RMSE down to roughly 0.40, which is better than the reported human inter-rater baseline of 0.62. The ablations are useful too—they show that naively adding text via cross-attention actually hurts, while the weak-learner stacking works better, and they include negative results on SpeechLM and zero-shot LLMs. That gives a clearer sense of where the gains come from and where they don't. The main gap is that everything is measured on SOMOS. No held-out TTS architectures, languages, or acoustic conditions are tested, so the claim that these models could replace or speed up human evaluation rests on an unproven assumption about generalization. Training details like exact data splits and statistical significance checks are also thin in the abstract. This is the sort of incremental metric work that TTS groups might want to look at for the ensemble idea and the human-baseline comparison, but it needs more diverse testing before anyone treats it as a reliable tool. I'd send it to referees to see whether the authors can add cross-dataset results or tighten the experimental protocol.

Referee Report

2 major / 2 minor

Summary. The paper introduces NeuralSBS (HuBERT-based) for relative TTS side-by-side evaluation (73.7% accuracy on SOMOS) and two absolute MOS predictors: an enhanced MOSNet with custom sequence-length batching plus WhisperBert, a multimodal stacking ensemble of Whisper audio features and BERT text embeddings. On SOMOS the best MOS models report RMSE ~0.40, beating the human inter-rater baseline of 0.62; ablations show that naive cross-attention text fusion hurts performance while stacking helps; negative results are reported for SpeechLM architectures and zero-shot LLMs (Qwen2-Audio, Gemini).

Significance. If the reported RMSE improvement and ensemble advantage hold under proper controls, the work would supply a concrete, trainable alternative to costly human MOS/SBS protocols and would usefully document the failure of current LLM-based zero-shot evaluators. The negative results on LLMs and the ablation on fusion strategies are informative contributions. However, because all quantitative claims rest on a single dataset (SOMOS) without out-of-distribution testing, the practical significance for new TTS architectures, languages, or acoustic conditions remains unestablished.

major comments (2)

[abstract and experimental sections] The central claim that the models achieve RMSE ~0.40 and thereby outperform the human inter-rater baseline of 0.62 is demonstrated exclusively on the SOMOS dataset. No results are reported on TTS systems, languages, or acoustic conditions absent from the training distribution, which directly undermines the asserted practical utility of NeuralSBS / WhisperBert / enhanced MOSNet as drop-in evaluators (see abstract and §4–5).
[§4 (methods) and §5 (experiments)] Data partitioning, statistical significance testing, and full training protocols (including the custom sequence-length batching hyperparameters) receive only limited description. Without explicit train/validation/test splits, cross-validation details, or confidence intervals on the reported RMSE and accuracy figures, it is impossible to assess whether the 0.40 RMSE reflects genuine generalization or overfitting to SOMOS idiosyncrasies.

minor comments (2)

[§3.2] Notation for the WhisperBert stacking ensemble (weak learners, fusion weights) should be defined more explicitly, ideally with a diagram or pseudocode.
[abstract and §5] The abstract states “significantly outperforming” the human baseline; the corresponding statistical test (paired t-test, bootstrap, etc.) should be reported in the main text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments correctly identify that all results are reported on SOMOS and that methodological details require expansion. We address each point below and will revise the manuscript to improve clarity and transparency.

read point-by-point responses

Referee: [abstract and experimental sections] The central claim that the models achieve RMSE ~0.40 and thereby outperform the human inter-rater baseline of 0.62 is demonstrated exclusively on the SOMOS dataset. No results are reported on TTS systems, languages, or acoustic conditions absent from the training distribution, which directly undermines the asserted practical utility of NeuralSBS / WhisperBert / enhanced MOSNet as drop-in evaluators (see abstract and §4–5).

Authors: We agree that the quantitative results are obtained exclusively on the SOMOS dataset and that this constrains strong claims of broad practical utility for arbitrary new TTS systems or acoustic conditions. The work is framed as an evaluation on the established SOMOS benchmark, following conventions in prior TTS metric papers. In the revision we will (i) rephrase the abstract to avoid implying immediate drop-in replacement across all settings and (ii) add an explicit limitations subsection that states the current scope and calls for future out-of-distribution validation. These textual changes will accurately reflect the manuscript’s contribution without overstating generalizability. revision: yes
Referee: [§4 (methods) and §5 (experiments)] Data partitioning, statistical significance testing, and full training protocols (including the custom sequence-length batching hyperparameters) receive only limited description. Without explicit train/validation/test splits, cross-validation details, or confidence intervals on the reported RMSE and accuracy figures, it is impossible to assess whether the 0.40 RMSE reflects genuine generalization or overfitting to SOMOS idiosyncrasies.

Authors: We accept that the current description of data handling and training details is insufficient. The revised manuscript will include: the precise train/validation/test split ratios and any random seeds; a statement on whether cross-validation was performed; full specification of the sequence-length batching hyperparameters and training schedule; and confidence intervals (or standard deviations across runs) for all reported RMSE and accuracy figures. Where appropriate, we will also add statistical significance tests comparing the proposed models against baselines. revision: yes

Circularity Check

0 steps flagged

No significant circularity; standard supervised evaluation on external data

full rationale

The paper trains models (NeuralSBS, WhisperBert, enhanced MOSNet) via supervised learning on external labeled datasets such as SOMOS and reports test-set metrics (accuracy 73.7%, RMSE ~0.40). The comparison to human inter-rater RMSE baseline of 0.62 is a direct empirical evaluation on held-out data, not a reduction of outputs to fitted parameters by construction. No equations, self-citations, or ansatzes are shown that force the central claims to equal their inputs. The derivation is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The work rests on standard transfer-learning assumptions from pretrained audio and text models plus supervised training on existing human-labeled datasets; no new physical entities or ad-hoc constants are introduced.

free parameters (1)

custom sequence-length batching hyperparameters
Tuned parameters for handling variable audio lengths in the MOSNet enhancement

axioms (1)

domain assumption Pretrained HuBERT, Whisper, and BERT models extract features sufficient for approximating human TTS quality judgments
Relies on transfer from these models without independent verification of feature completeness for the target task

pith-pipeline@v0.9.0 · 5574 in / 1169 out tokens · 32579 ms · 2026-05-15T09:43:57.123500+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

11 extracted references · 11 canonical work pages

[1]

UTMOS: UTokyo-SaruLab system for V oiceMOS challenge 2022,

T. Saeki, D. Xin, W. Nakata, T. Koriyama, S. Takamichi, and H. Saruwatari, “UTMOS: UTokyo-SaruLab system for V oiceMOS challenge 2022,” inProc. Interspeech, 2022, pp. 4546–4550

work page 2022
[2]

V oiceMOS challenge 2024,

V oiceMOS Challenge, “V oiceMOS challenge 2024,” https://sites.google.com/view/voicemos-challenge, 2024

work page 2024
[3]

HuBERT: Self- supervised speech representation learning by masked pre- diction of hidden units,

W.-N. Hsu, B. Bolte, Y .-H. H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mohamed, “HuBERT: Self-supervised speech representation learning by masked prediction of hidden units,”arXiv preprint arXiv:2106.07447, 2021

work page arXiv 2021
[4]

MOSNet: Deep learning based objective assessment for voice conversion,

C.-C. Loet al., “MOSNet: Deep learning based objective assessment for voice conversion,”arXiv preprint arXiv:1904.08352, 2019

work page arXiv 1904
[5]

SOMOS: The samsung open MOS dataset for the prediction of synthesized speech quality,

G. Chessexet al., “SOMOS: The samsung open MOS dataset for the prediction of synthesized speech quality,” in Proc. Interspeech, 2022

work page 2022
[6]

DNSMOS: A non-intrusive perceptual objective speech quality metric to evaluate noise suppressors,

C. K. A. Reddyet al., “DNSMOS: A non-intrusive perceptual objective speech quality metric to evaluate noise suppressors,” inProc. ICASSP, 2021, pp. 6493–6497

work page 2021
[7]

DistilMOS: Layer-wise self-distillation for self-supervised learning model-based MOS prediction,

J. Yang, W. Nakata, Y . Saito, and H. Saruwatari, “DistilMOS: Layer-wise self-distillation for self-supervised learning model-based MOS prediction,”arXiv preprint arXiv:2601.13700, 2026

work page arXiv 2026
[8]

NISQA: A deep CNN-self-attention model for multidimensional speech quality prediction,

G. Mittag and S. M ¨oller, “NISQA: A deep CNN-self-attention model for multidimensional speech quality prediction,”Proc. Interspeech, pp. 2127–2131, 2021

work page 2021
[9]

Neural side-by-side: Predicting human preferences for no-reference super- resolution evaluation,

V . Khrulkov and A. Babenko, “Neural side-by-side: Predicting human preferences for no-reference super- resolution evaluation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021

work page 2021
[10]

The unreasonable effectiveness of deep features as a perceptual metric,

R. Zhanget al., “The unreasonable effectiveness of deep features as a perceptual metric,” inProc. CVPR, 2018, pp. 586–595

work page 2018
[11]

Speechgpt: Empowering large language models with intrinsic cross-modal conversational abilities

Y . Zhanget al., “SpeechGPT: Empowering large language models with intrinsic cross-modal conversational abilities,”arXiv preprint arXiv:2305.11000, 2023. 8

work page arXiv 2023