How to Evaluate Speech Translation with Source-Aware Neural MT Metrics

· 2025 · cs.CL · arXiv 2511.03295

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

open full Pith review browse 2 citing papers arXiv PDF

abstract

Automatic evaluation of ST systems is typically performed by comparing translation hypotheses with one or more reference translations. While effective to some extent, this approach inherits the limitation of reference-based evaluation that ignores valuable information from the source input. In MT, recent progress has shown that neural metrics incorporating the source text achieve stronger correlation with human judgments. Extending this idea to ST, however, is not trivial because the source is audio rather than text, and reliable transcripts or alignments between source and references are often unavailable. In this work, we conduct the first systematic study of source-aware metrics for ST, with a particular focus on real-world operating conditions where source transcripts are not available. We explore two complementary strategies for generating textual proxies of the input audio, ASR transcripts, and back-translations of the reference translation, and introduce a novel two-step cross-lingual re-segmentation algorithm to address the alignment mismatch between synthetic sources and reference translations. Our experiments, carried out on two ST benchmarks covering 79 language pairs and six ST systems with diverse architectures and performance levels, show that ASR transcripts constitute a more reliable synthetic source than back-translations when word error rate is below 20%, while back-translations always represent a computationally cheaper but still effective alternative. The robustness of these findings is further confirmed by experiments on a low-resource language pair (Bemba-English) and by a direct validation against human quality judgments. Furthermore, our cross-lingual re-segmentation algorithm enables robust use of source-aware MT metrics in ST evaluation, paving the way toward more accurate and principled evaluation methodologies for speech translation.

representative citing papers

Why We Need Speech to Evaluate Speech Translation

cs.CL · 2026-05-27 · unverdicted · novelty 6.0

Meta-evaluation on gender and prosody contrastive datasets finds text and speech quality estimation metrics fall short at assessing speech-specific features, including newly trained SpeechCOMET models.

HydraQE: OSU's Submission for the IWSLT 2026 Speech Translation Metrics Shared Task

cs.CL · 2026-06-07 · unverdicted · novelty 4.0

HydraQE is a new end-to-end speech translation QE system using Qwen3-ASR backbone, sparsemax layer mixing, bidirectional Transformer, and multi-task curriculum training on human and pseudo labels that outperforms cascaded baselines.

citing papers explorer

Showing 2 of 2 citing papers.

Why We Need Speech to Evaluate Speech Translation cs.CL · 2026-05-27 · unverdicted · none · ref 1 · internal anchor
Meta-evaluation on gender and prosody contrastive datasets finds text and speech quality estimation metrics fall short at assessing speech-specific features, including newly trained SpeechCOMET models.
HydraQE: OSU's Submission for the IWSLT 2026 Speech Translation Metrics Shared Task cs.CL · 2026-06-07 · unverdicted · none · ref 3 · internal anchor
HydraQE is a new end-to-end speech translation QE system using Qwen3-ASR backbone, sparsemax layer mixing, bidirectional Transformer, and multi-task curriculum training on human and pseudo labels that outperforms cascaded baselines.

How to Evaluate Speech Translation with Source-Aware Neural MT Metrics

fields

years

verdicts

representative citing papers

citing papers explorer