Why We Need Speech to Evaluate Speech Translation
Pith reviewed 2026-06-29 13:30 UTC · model grok-4.3
The pith
Current metrics for speech translation fail to detect whether gender agreement and prosody are preserved, even when given the speech signal directly.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Both text- and speech-based quality estimation metrics fall short at assessing preservation of gender agreement and prosody in speech translation, even when given direct access to the speech signal. SpeechCOMET models with speech encoders and a state-of-the-art SpeechLLM judge match or exceed text-based COMET on standard quality estimation yet do not consistently evaluate speech-specific phenomena. The shortcomings arise because speech-specific features are not reliably preserved in current encoders, models tend to ignore the speech source signal, and quality estimation training data contains too few relevant examples.
What carries the argument
Contrastive datasets targeting gender agreement and prosody mismatches, used to meta-evaluate quality estimation metrics including new SpeechCOMET variants with speech encoders.
If this is right
- Standard quality estimation benchmarks are insufficient to measure progress on speech-specific preservation in translation.
- Speech encoders must be improved so that gender and prosody features are reliably encoded.
- Training data for quality estimation must include more examples that target speech-specific phenomena.
- Quality estimation models need architectures that genuinely condition on the speech source rather than defaulting to text.
Where Pith is reading between the lines
- Similar evaluation gaps are likely to appear in other speech-to-text or speech-to-speech tasks that involve paralinguistic features.
- Improving speech conditioning in quality estimation could indirectly raise the bar for speech translation model training itself.
- Future work could test whether larger or differently pre-trained speech encoders reduce the observed gaps without new task-specific data.
Load-bearing premise
The contrastive datasets isolate gender agreement and prosody without introducing other differences that could affect metric scores.
What would settle it
A quality estimation model that assigns higher scores to the gender-matching translation than to the mismatched one on most pairs in the gender contrastive set, and likewise for the prosody contrastive set.
Figures
read the original abstract
Speech translation models are increasingly capable of preserving speech-specific information (e.g., speaker gender, prosody, and emphasis), yet evaluation metrics remain blind to such phenomena. We meta-evaluate both text- and speech-based quality estimation metrics on two contrastive datasets targeting gender agreement and prosody, and find that both fall short, even when given direct access to the speech signal. We then train SpeechCOMET, a family of quality estimation models with speech encoders, and evaluate a state-of-the-art SpeechLLM as a judge. Both match or exceed text-based COMET on standard quality estimation, but neither consistently assesses speech-specific phenomena. We identify three causes: (1) speech-specific features are not reliably preserved in current encoders, (2) models tend to ignore the speech source signal, and (3) quality estimation training data contains too few relevant examples. We release all models and code, and argue that progress requires dedicated speech-specific training data and models that genuinely condition on speech.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that existing text- and speech-based quality estimation (QE) metrics for speech translation fail to assess speech-specific phenomena such as gender agreement and prosody, even when given direct access to the speech signal. It meta-evaluates metrics on two new contrastive datasets targeting these phenomena, trains a family of speech-encoder QE models called SpeechCOMET that match or exceed text COMET on standard QE benchmarks, and tests a state-of-the-art SpeechLLM as a judge; both still fall short on the speech-specific tasks. The authors identify three causes—unreliable preservation of speech features in encoders, tendency to ignore the speech source, and scarcity of relevant examples in QE training data—and release all models and code, arguing for dedicated speech-specific training resources.
Significance. If the meta-evaluation results hold, the work identifies a substantive gap in current evaluation practices for speech translation systems that aim to preserve paralinguistic information. The explicit release of models, code, and (presumably) the contrastive datasets supports reproducibility and follow-on work. The diagnosis of three concrete failure modes provides actionable directions, though the strength of the conclusions depends on the validity of the dataset construction.
major comments (2)
- [Dataset construction] Dataset construction section: the central claim that metrics 'fall short' on speech-specific phenomena rests on the assumption that the two contrastive datasets isolate gender agreement and prosody without confounds in fluency, semantics, or acoustic naturalness. No description of pair-construction method (editing, synthesis, or otherwise), human validation of isolation, or ablation confirming other quality dimensions are held constant is provided; without this, score differences cannot be attributed specifically to failure on the targeted features.
- [Analysis of failure modes] Section 4 (or equivalent) on causes: the three identified causes—(1) speech features not reliably preserved in encoders, (2) models ignoring the speech source signal, and (3) insufficient relevant examples in training data—are presented as explanations, but the supporting experiments (e.g., ablation on source conditioning or data statistics) are not detailed enough to establish that these are the load-bearing reasons rather than artifacts of the particular contrastive pairs.
minor comments (2)
- [Abstract] Abstract and introduction: the claim that 'both fall short, even when given direct access to the speech signal' would be clearer if the exact speech-based metrics and input formats were named at first mention.
- [Conclusion] The paper states it releases 'all models and code'; confirm that the contrastive datasets themselves are also released with documentation of their construction pipeline.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive report. The two major comments highlight important areas where additional clarity and evidence are needed to support our claims about the contrastive datasets and the identified failure modes. We address each point below and will revise the manuscript to incorporate the requested details and expanded analyses.
read point-by-point responses
-
Referee: [Dataset construction] Dataset construction section: the central claim that metrics 'fall short' on speech-specific phenomena rests on the assumption that the two contrastive datasets isolate gender agreement and prosody without confounds in fluency, semantics, or acoustic naturalness. No description of pair-construction method (editing, synthesis, or otherwise), human validation of isolation, or ablation confirming other quality dimensions are held constant is provided; without this, score differences cannot be attributed specifically to failure on the targeted features.
Authors: We agree that the manuscript currently lacks sufficient detail on dataset construction to fully substantiate the isolation of the targeted phenomena. The provided overview does not include explicit descriptions of pair-construction methods, human validation protocols, or ablations verifying that fluency, semantics, and acoustic naturalness remain constant. In the revised manuscript, we will expand the relevant section to describe the construction process (including any use of editing or synthesis), report the results of human validation studies confirming feature isolation, and include ablations or controls demonstrating that other quality dimensions are held constant. This will allow the score differences to be more confidently attributed to the speech-specific features. revision: yes
-
Referee: [Analysis of failure modes] Section 4 (or equivalent) on causes: the three identified causes—(1) speech features not reliably preserved in encoders, (2) models ignoring the speech source signal, and (3) insufficient relevant examples in training data—are presented as explanations, but the supporting experiments (e.g., ablation on source conditioning or data statistics) are not detailed enough to establish that these are the load-bearing reasons rather than artifacts of the particular contrastive pairs.
Authors: We acknowledge that the current presentation of the three failure modes relies on supporting experiments that may not be detailed enough to establish them as the primary causes independent of the specific contrastive pairs. The manuscript includes initial evidence from source-conditioning ablations and training-data statistics, but these require expansion to rule out potential artifacts. In the revision, we will add more detailed experiments in the relevant section, including additional quantitative ablations on source signal usage, probing results for feature preservation in encoders, and expanded statistics on the scarcity of relevant examples in QE training data. We will qualify the conclusions if the expanded analyses do not fully support the three causes as load-bearing. revision: yes
Circularity Check
No significant circularity; empirical meta-evaluation is self-contained
full rationale
The paper conducts meta-evaluation of quality estimation metrics on contrastive datasets for gender agreement and prosody, trains SpeechCOMET models, and evaluates a SpeechLLM judge. No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text. Claims rest on experimental results and dataset construction rather than reducing by construction to inputs or prior self-authored uniqueness results. The central findings are falsifiable via the released models and code.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The contrastive datasets targeting gender agreement and prosody isolate speech-specific phenomena without confounding variables that would affect metric scores.
invented entities (1)
-
SpeechCOMET
no independent evidence
Reference graph
Works this paper leans on
-
[1]
How to Evaluate Speech Translation with Source-Aware Neural MT Metrics
Gender in danger? evaluating speech transla- tion technology on the MuST-SHE corpus. InPro- ceedings of the 58th Annual Meeting of the Asso- ciation for Computational Linguistics, pages 6923– 6933, Online. Association for Computational Lin- guistics. 9 Mauro Cettolo, Marco Gaido, Matteo Negri, Sara Papi, and Luisa Bentivogli. 2025. How to evaluate speech ...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[2]
Voice, Bias, and Coreference: An Interpretability Study of Gender in Speech Translation
V oice, bias, and coreference: An interpretabil- ity study of gender in speech translation.Preprint, arXiv:2511.21517. David Dale and Marta R. Costa-jussà. 2024. BLASER 2.0: a metric for evaluation and quality estimation of massively multilingual speech and text translation. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 16...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[3]
Hearing to Translate: The Effectiveness of Speech Modality Integration into LLMs
Hearing to translate: The effectiveness of speech modality integration into llms.Preprint, arXiv:2512.16378. Alec Radford, Jong Wook Kim, Tao Xu, Greg Brock- man, Christine Mcleavey, and Ilya Sutskever. 2023. Robust speech recognition via large-scale weak su- pervision. InProceedings of the 40th International Conference on Machine Learning, volume 202 of ...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[4]
COMET-22: Unbabel-IST 2022 submission for the metrics shared task. InProceedings of the Seventh Conference on Machine Translation (WMT), pages 578–585, Abu Dhabi, United Arab Emirates (Hybrid). Association for Computational Linguistics. Haoqin Sun, Xuechen Wang, Jinghua Zhao, Shiwan Zhao, Jiaming Zhou, Hui Wang, Jiabei He, Aobo Kong, Xi Yang, Yequan Wang,...
-
[5]
Llamafactory: Unified efficient fine-tuning of 100+ language models. InProceedings of the 62nd Annual Meeting of the Association for Compu- tational Linguistics (Volume 3: System Demonstra- tions), Bangkok, Thailand. Association for Computa- tional Linguistics. Vilém Zouhar, Shehzaad Dhuliawala, Wangchunshu Zhou, Nico Daheim, Tom Kocmi, Yuchen Eleanor Jia...
-
[6]
prosodic breaks
with LoRA adapters (Hu et al., 2021). Fine- tuning and inference are performed on a single NVIDIA A100-SXM4-40GB GPU. Fine-tuning takes around 13 hours. Detailed hyperparameters are listed in Table 5. 11 Standard QE System:You are an evaluator. Given the source and/or audio and a translation, respond with only a single float score between 0 and 1 indicati...
2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.