Why We Need Speech to Evaluate Speech Translation

Danni Liu; Jan Niehues; Maike Z\"ufle; Vil\'em Zouhar

arxiv: 2605.28227 · v1 · pith:24UUIQ3Bnew · submitted 2026-05-27 · 💻 cs.CL

Why We Need Speech to Evaluate Speech Translation

Maike Z\"ufle , Danni Liu , Vil\'em Zouhar , Jan Niehues This is my paper

Pith reviewed 2026-06-29 13:30 UTC · model grok-4.3

classification 💻 cs.CL

keywords speech translationquality estimationspeech-specific phenomenagender agreementprosodyevaluation metricsCOMET

0 comments

The pith

Current metrics for speech translation fail to detect whether gender agreement and prosody are preserved, even when given the speech signal directly.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Speech translation models increasingly keep speaker gender, prosody, and emphasis in their outputs, yet quality estimation metrics cannot tell whether those features survive. The authors create contrastive test sets that differ only in gender agreement or prosody and show that both text-based and speech-based metrics perform no better than chance at preferring the version that matches the source speech. New speech-encoder models and a SpeechLLM judge reach or exceed ordinary COMET scores on standard benchmarks but still fail on the speech-specific tests. The paper attributes the gap to encoders that do not reliably carry the features, models that largely ignore the speech input, and training data that contains too few relevant examples. It concludes that dedicated speech-specific training data and models that actually condition on speech are required.

Core claim

Both text- and speech-based quality estimation metrics fall short at assessing preservation of gender agreement and prosody in speech translation, even when given direct access to the speech signal. SpeechCOMET models with speech encoders and a state-of-the-art SpeechLLM judge match or exceed text-based COMET on standard quality estimation yet do not consistently evaluate speech-specific phenomena. The shortcomings arise because speech-specific features are not reliably preserved in current encoders, models tend to ignore the speech source signal, and quality estimation training data contains too few relevant examples.

What carries the argument

Contrastive datasets targeting gender agreement and prosody mismatches, used to meta-evaluate quality estimation metrics including new SpeechCOMET variants with speech encoders.

If this is right

Standard quality estimation benchmarks are insufficient to measure progress on speech-specific preservation in translation.
Speech encoders must be improved so that gender and prosody features are reliably encoded.
Training data for quality estimation must include more examples that target speech-specific phenomena.
Quality estimation models need architectures that genuinely condition on the speech source rather than defaulting to text.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar evaluation gaps are likely to appear in other speech-to-text or speech-to-speech tasks that involve paralinguistic features.
Improving speech conditioning in quality estimation could indirectly raise the bar for speech translation model training itself.
Future work could test whether larger or differently pre-trained speech encoders reduce the observed gaps without new task-specific data.

Load-bearing premise

The contrastive datasets isolate gender agreement and prosody without introducing other differences that could affect metric scores.

What would settle it

A quality estimation model that assigns higher scores to the gender-matching translation than to the mismatched one on most pairs in the gender contrastive set, and likewise for the prosody contrastive set.

Figures

Figures reproduced from arXiv: 2605.28227 by Danni Liu, Jan Niehues, Maike Z\"ufle, Vil\'em Zouhar.

**Figure 2.** Figure 2: Probing accuracy for linguistic features on the IWSLT 2026 Metrics dataset. Scores averaged over 3 [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Probing accuracy for speaker gender on MuST-SHE and for intonation (question vs. statement) and emotion (angry, happy, neutral, surprised) on ContraProST. Scores averaged over three runs (std. ≤ 0.02). SpeechCOMETWhisper and SpeechLLMs enable accurate recovery of acoustic features while SpeechCOMETSONAR does not. Whisper pooling reinforces the conclusion that the quality estimation objective, in its curren… view at source ↗

**Figure 4.** Figure 4: System and user prompts used for Qwen2.5-Omni-7B QE. The system prompt varies across the three [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗

**Figure 5.** Figure 5: Probing accuracy for linguistic features on the IWSLT 2026 Metrics dataset, grouped by input modality [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗

**Figure 6.** Figure 6: Probing accuracy for acoustic features on MuST-SHE (speaker gender) and ContraProST (intonation, [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗

read the original abstract

Speech translation models are increasingly capable of preserving speech-specific information (e.g., speaker gender, prosody, and emphasis), yet evaluation metrics remain blind to such phenomena. We meta-evaluate both text- and speech-based quality estimation metrics on two contrastive datasets targeting gender agreement and prosody, and find that both fall short, even when given direct access to the speech signal. We then train SpeechCOMET, a family of quality estimation models with speech encoders, and evaluate a state-of-the-art SpeechLLM as a judge. Both match or exceed text-based COMET on standard quality estimation, but neither consistently assesses speech-specific phenomena. We identify three causes: (1) speech-specific features are not reliably preserved in current encoders, (2) models tend to ignore the speech source signal, and (3) quality estimation training data contains too few relevant examples. We release all models and code, and argue that progress requires dedicated speech-specific training data and models that genuinely condition on speech.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Speech QE metrics miss gender and prosody even with audio access, but the contrastive datasets need clearer isolation checks to support the claims.

read the letter

This paper's main point is that quality estimation metrics for speech translation do not pick up on speech-specific details like speaker gender and prosody, even when the metrics receive the speech signal directly. Text-based and speech-based metrics both fall short on the authors' tests.

The new elements are the two contrastive datasets built to target gender agreement and prosody, plus the training of SpeechCOMET models that use speech encoders. They also test a SpeechLLM judge. The new models match or beat text COMET on ordinary QE tasks, yet still fail to track the speech features consistently. The authors name three causes: encoders that do not reliably keep the features, models that ignore the speech source, and training data with too few relevant cases. Releasing the models and code is a concrete step that others can use.

The soft spot is the contrastive dataset construction. If the pairs were made by editing speech, changing text, or synthesis, other differences in fluency, semantics, or acoustic quality could affect the scores. Without details on controls or human checks that the only difference is the target feature, it is hard to attribute the metric failures specifically to speech phenomena rather than something else. The abstract supplies no numbers or error analysis, so the size of the shortfall is also unclear.

The work is aimed at people who build or evaluate speech translation systems and QE models. It flags a gap that matters for applications like multilingual tools. It deserves peer review because the problem is real and the resource release helps, though the methods will need to show how the datasets isolate the intended phenomena.

Referee Report

2 major / 2 minor

Summary. The paper claims that existing text- and speech-based quality estimation (QE) metrics for speech translation fail to assess speech-specific phenomena such as gender agreement and prosody, even when given direct access to the speech signal. It meta-evaluates metrics on two new contrastive datasets targeting these phenomena, trains a family of speech-encoder QE models called SpeechCOMET that match or exceed text COMET on standard QE benchmarks, and tests a state-of-the-art SpeechLLM as a judge; both still fall short on the speech-specific tasks. The authors identify three causes—unreliable preservation of speech features in encoders, tendency to ignore the speech source, and scarcity of relevant examples in QE training data—and release all models and code, arguing for dedicated speech-specific training resources.

Significance. If the meta-evaluation results hold, the work identifies a substantive gap in current evaluation practices for speech translation systems that aim to preserve paralinguistic information. The explicit release of models, code, and (presumably) the contrastive datasets supports reproducibility and follow-on work. The diagnosis of three concrete failure modes provides actionable directions, though the strength of the conclusions depends on the validity of the dataset construction.

major comments (2)

[Dataset construction] Dataset construction section: the central claim that metrics 'fall short' on speech-specific phenomena rests on the assumption that the two contrastive datasets isolate gender agreement and prosody without confounds in fluency, semantics, or acoustic naturalness. No description of pair-construction method (editing, synthesis, or otherwise), human validation of isolation, or ablation confirming other quality dimensions are held constant is provided; without this, score differences cannot be attributed specifically to failure on the targeted features.
[Analysis of failure modes] Section 4 (or equivalent) on causes: the three identified causes—(1) speech features not reliably preserved in encoders, (2) models ignoring the speech source signal, and (3) insufficient relevant examples in training data—are presented as explanations, but the supporting experiments (e.g., ablation on source conditioning or data statistics) are not detailed enough to establish that these are the load-bearing reasons rather than artifacts of the particular contrastive pairs.

minor comments (2)

[Abstract] Abstract and introduction: the claim that 'both fall short, even when given direct access to the speech signal' would be clearer if the exact speech-based metrics and input formats were named at first mention.
[Conclusion] The paper states it releases 'all models and code'; confirm that the contrastive datasets themselves are also released with documentation of their construction pipeline.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report. The two major comments highlight important areas where additional clarity and evidence are needed to support our claims about the contrastive datasets and the identified failure modes. We address each point below and will revise the manuscript to incorporate the requested details and expanded analyses.

read point-by-point responses

Referee: [Dataset construction] Dataset construction section: the central claim that metrics 'fall short' on speech-specific phenomena rests on the assumption that the two contrastive datasets isolate gender agreement and prosody without confounds in fluency, semantics, or acoustic naturalness. No description of pair-construction method (editing, synthesis, or otherwise), human validation of isolation, or ablation confirming other quality dimensions are held constant is provided; without this, score differences cannot be attributed specifically to failure on the targeted features.

Authors: We agree that the manuscript currently lacks sufficient detail on dataset construction to fully substantiate the isolation of the targeted phenomena. The provided overview does not include explicit descriptions of pair-construction methods, human validation protocols, or ablations verifying that fluency, semantics, and acoustic naturalness remain constant. In the revised manuscript, we will expand the relevant section to describe the construction process (including any use of editing or synthesis), report the results of human validation studies confirming feature isolation, and include ablations or controls demonstrating that other quality dimensions are held constant. This will allow the score differences to be more confidently attributed to the speech-specific features. revision: yes
Referee: [Analysis of failure modes] Section 4 (or equivalent) on causes: the three identified causes—(1) speech features not reliably preserved in encoders, (2) models ignoring the speech source signal, and (3) insufficient relevant examples in training data—are presented as explanations, but the supporting experiments (e.g., ablation on source conditioning or data statistics) are not detailed enough to establish that these are the load-bearing reasons rather than artifacts of the particular contrastive pairs.

Authors: We acknowledge that the current presentation of the three failure modes relies on supporting experiments that may not be detailed enough to establish them as the primary causes independent of the specific contrastive pairs. The manuscript includes initial evidence from source-conditioning ablations and training-data statistics, but these require expansion to rule out potential artifacts. In the revision, we will add more detailed experiments in the relevant section, including additional quantitative ablations on source signal usage, probing results for feature preservation in encoders, and expanded statistics on the scarcity of relevant examples in QE training data. We will qualify the conclusions if the expanded analyses do not fully support the three causes as load-bearing. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical meta-evaluation is self-contained

full rationale

The paper conducts meta-evaluation of quality estimation metrics on contrastive datasets for gender agreement and prosody, trains SpeechCOMET models, and evaluates a SpeechLLM judge. No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text. Claims rest on experimental results and dataset construction rather than reducing by construction to inputs or prior self-authored uniqueness results. The central findings are falsifiable via the released models and code.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim depends on the assumption that the contrastive datasets validly isolate gender agreement and prosody. SpeechCOMET is introduced as a new model family without independent evidence of its superiority on speech features.

axioms (1)

domain assumption The contrastive datasets targeting gender agreement and prosody isolate speech-specific phenomena without confounding variables that would affect metric scores.
The paper's conclusion that metrics fall short rests directly on these datasets being effective probes.

invented entities (1)

SpeechCOMET no independent evidence
purpose: Family of quality estimation models using speech encoders to evaluate speech translation.
Newly trained models introduced to test whether speech input improves assessment of speech-specific features.

pith-pipeline@v0.9.1-grok · 5705 in / 1336 out tokens · 48542 ms · 2026-06-29T13:30:43.582567+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

6 extracted references · 5 canonical work pages · 3 internal anchors

[1]

How to Evaluate Speech Translation with Source-Aware Neural MT Metrics

Gender in danger? evaluating speech transla- tion technology on the MuST-SHE corpus. InPro- ceedings of the 58th Annual Meeting of the Asso- ciation for Computational Linguistics, pages 6923– 6933, Online. Association for Computational Lin- guistics. 9 Mauro Cettolo, Marco Gaido, Matteo Negri, Sara Papi, and Luisa Bentivogli. 2025. How to evaluate speech ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Voice, Bias, and Coreference: An Interpretability Study of Gender in Speech Translation

V oice, bias, and coreference: An interpretabil- ity study of gender in speech translation.Preprint, arXiv:2511.21517. David Dale and Marta R. Costa-jussà. 2024. BLASER 2.0: a metric for evaluation and quality estimation of massively multilingual speech and text translation. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 16...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[3]

Hearing to Translate: The Effectiveness of Speech Modality Integration into LLMs

Hearing to translate: The effectiveness of speech modality integration into llms.Preprint, arXiv:2512.16378. Alec Radford, Jong Wook Kim, Tao Xu, Greg Brock- man, Christine Mcleavey, and Ilya Sutskever. 2023. Robust speech recognition via large-scale weak su- pervision. InProceedings of the 40th International Conference on Machine Learning, volume 202 of ...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[4]

InProceedings of the Seventh Conference on Machine Translation (WMT), pages 578–585, Abu Dhabi, United Arab Emirates (Hybrid)

COMET-22: Unbabel-IST 2022 submission for the metrics shared task. InProceedings of the Seventh Conference on Machine Translation (WMT), pages 578–585, Abu Dhabi, United Arab Emirates (Hybrid). Association for Computational Linguistics. Haoqin Sun, Xuechen Wang, Jinghua Zhao, Shiwan Zhao, Jiaming Zhou, Hui Wang, Jiabei He, Aobo Kong, Xi Yang, Yequan Wang,...

work page arXiv 2022
[5]

InProceedings of the 62nd Annual Meeting of the Association for Compu- tational Linguistics (Volume 3: System Demonstra- tions), Bangkok, Thailand

Llamafactory: Unified efficient fine-tuning of 100+ language models. InProceedings of the 62nd Annual Meeting of the Association for Compu- tational Linguistics (Volume 3: System Demonstra- tions), Bangkok, Thailand. Association for Computa- tional Linguistics. Vilém Zouhar, Shehzaad Dhuliawala, Wangchunshu Zhou, Nico Daheim, Tom Kocmi, Yuchen Eleanor Jia...

work page arXiv 2023
[6]

prosodic breaks

with LoRA adapters (Hu et al., 2021). Fine- tuning and inference are performed on a single NVIDIA A100-SXM4-40GB GPU. Fine-tuning takes around 13 hours. Detailed hyperparameters are listed in Table 5. 11 Standard QE System:You are an evaluator. Given the source and/or audio and a translation, respond with only a single float score between 0 and 1 indicati...

2021

[1] [1]

How to Evaluate Speech Translation with Source-Aware Neural MT Metrics

Gender in danger? evaluating speech transla- tion technology on the MuST-SHE corpus. InPro- ceedings of the 58th Annual Meeting of the Asso- ciation for Computational Linguistics, pages 6923– 6933, Online. Association for Computational Lin- guistics. 9 Mauro Cettolo, Marco Gaido, Matteo Negri, Sara Papi, and Luisa Bentivogli. 2025. How to evaluate speech ...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

Voice, Bias, and Coreference: An Interpretability Study of Gender in Speech Translation

V oice, bias, and coreference: An interpretabil- ity study of gender in speech translation.Preprint, arXiv:2511.21517. David Dale and Marta R. Costa-jussà. 2024. BLASER 2.0: a metric for evaluation and quality estimation of massively multilingual speech and text translation. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 16...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[3] [3]

Hearing to Translate: The Effectiveness of Speech Modality Integration into LLMs

Hearing to translate: The effectiveness of speech modality integration into llms.Preprint, arXiv:2512.16378. Alec Radford, Jong Wook Kim, Tao Xu, Greg Brock- man, Christine Mcleavey, and Ilya Sutskever. 2023. Robust speech recognition via large-scale weak su- pervision. InProceedings of the 40th International Conference on Machine Learning, volume 202 of ...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[4] [4]

InProceedings of the Seventh Conference on Machine Translation (WMT), pages 578–585, Abu Dhabi, United Arab Emirates (Hybrid)

COMET-22: Unbabel-IST 2022 submission for the metrics shared task. InProceedings of the Seventh Conference on Machine Translation (WMT), pages 578–585, Abu Dhabi, United Arab Emirates (Hybrid). Association for Computational Linguistics. Haoqin Sun, Xuechen Wang, Jinghua Zhao, Shiwan Zhao, Jiaming Zhou, Hui Wang, Jiabei He, Aobo Kong, Xi Yang, Yequan Wang,...

work page arXiv 2022

[5] [5]

InProceedings of the 62nd Annual Meeting of the Association for Compu- tational Linguistics (Volume 3: System Demonstra- tions), Bangkok, Thailand

Llamafactory: Unified efficient fine-tuning of 100+ language models. InProceedings of the 62nd Annual Meeting of the Association for Compu- tational Linguistics (Volume 3: System Demonstra- tions), Bangkok, Thailand. Association for Computa- tional Linguistics. Vilém Zouhar, Shehzaad Dhuliawala, Wangchunshu Zhou, Nico Daheim, Tom Kocmi, Yuchen Eleanor Jia...

work page arXiv 2023

[6] [6]

prosodic breaks

with LoRA adapters (Hu et al., 2021). Fine- tuning and inference are performed on a single NVIDIA A100-SXM4-40GB GPU. Fine-tuning takes around 13 hours. Detailed hyperparameters are listed in Table 5. 11 Standard QE System:You are an evaluator. Given the source and/or audio and a translation, respond with only a single float score between 0 and 1 indicati...

2021