Benchmarking Speech-to-Speech Translation Models

Alkis Koudounas; Emiru Tsunoo; Hayato Futami; Osamu Take; Quentin Jodelet; Shinji Watanabe

arxiv: 2606.03241 · v1 · pith:5BC4R2YBnew · submitted 2026-06-02 · 💻 cs.CL · eess.AS

Benchmarking Speech-to-Speech Translation Models

Alkis Koudounas , Hayato Futami , Quentin Jodelet , Osamu Take , Shinji Watanabe , Emiru Tsunoo This is my paper

Pith reviewed 2026-06-28 10:31 UTC · model grok-4.3

classification 💻 cs.CL eess.AS

keywords speech-to-speech translationevaluation metricsbenchmarking frameworknaturalnessspeaker preservationtranslation qualityhuman correlationmetric reduction

0 comments

The pith

S2ST architectures differ by over 30% in naturalness and speaker preservation but only a few points in translation quality, so single-metric rankings misrepresent model performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces COMPASS, a framework that applies 46 metrics across eight quality dimensions to 1,248 speech-to-speech translation configurations from two datasets. It finds that different architectures excel in different areas, with large gaps in naturalness and speaker preservation but small gaps in translation quality. Correlation analysis reduces the metrics to ten per translation direction while preserving overall rankings. Human listening tests across three domains confirm that certain domain-specific metrics align closely with listener preferences, whereas generic mean-opinion-score predictors do not.

Core claim

Applying the COMPASS suite to cascaded and end-to-end models shows complementary strengths: best-versus-worst gaps exceed 30 percent on naturalness and speaker preservation yet stay within a few points on translation quality. Correlation filtering yields ten metrics per direction that maintain Spearman's rank correlation above 0.80 with the full set and cut evaluation time by roughly 2.5 times. In human validation, standalone MOS predictors fail to predict preference, but the top domain-specific metrics reach correlation of at least 0.90 with listener judgments.

What carries the argument

COMPASS, the unified benchmarking framework that integrates 46 metrics across eight dimensions and applies correlation filtering to produce compact direction-specific subsets.

If this is right

Single-metric leaderboards will systematically misrepresent relative system quality across architectures.
Translation direction determines which metrics are most informative, requiring separate subsets for X to English and English to X.
Evaluation cost drops by a factor of about 2.5 while rank order is preserved when using the filtered metric sets.
Domain-specific metrics, not generic MOS predictors, should be used for human-aligned assessment in dubbing, podcast, and medical settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Model development could shift toward reporting the reduced metric panel rather than any single score.
The framework supplies a practical starting point for testing whether new domains require yet other metric combinations.
Hybrid cascaded and end-to-end pipelines might combine the complementary strengths observed in the benchmark.

Load-bearing premise

The 46 chosen metrics and the correlation-filtering step are assumed to retain the essential quality distinctions without bias introduced by the particular datasets or model configurations tested.

What would settle it

A new collection of S2ST models evaluated with both the full 46-metric set and the reduced 10-metric subsets produces rankings that disagree on which systems are best, or human preference scores in a held-out domain diverge from the reported correlations.

Figures

Figures reproduced from arXiv: 2606.03241 by Alkis Koudounas, Emiru Tsunoo, Hayato Futami, Osamu Take, Quentin Jodelet, Shinji Watanabe.

**Figure 2.** Figure 2: Models’ performance on key metrics. FLEURS X [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Language proficiency, FLEURS (top) and CVSS (bottom), X [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Two-dimensional t-SNE visualization of the full S2ST metric space for X→EN (up) and EN→X (bottom) translation directions. Metrics are colored according to their primary evaluation dimension. The clear spatial separation into disjoint, well-defined clusters highlights that the COMPASS filtering pipeline successfully identifies distinct orthogonal dimensions while minimizing intra-dimension structural red… view at source ↗

**Figure 5.** Figure 5: Pairwise Spearman rank correlation (ρ) matrices between metrics selected for the final compact subsets across X→EN (left) and EN→X (right) directions. The matrix demonstrates low cross-dimensional correlations. This confirms that the selected metrics deliver independent, complementary signals regarding system performance. pretation rests on the direction and magnitude of the change between read- and sponta… view at source ↗

**Figure 6.** Figure 6: Per-language system performance profiles across the final filtered COMPASS dimensions on the FLEURS [PITH_FULL_IMAGE:figures/full_fig_p025_6.png] view at source ↗

**Figure 7.** Figure 7: Per-language results for EN→X translation on the FLEURS dataset. These plots show how different systems perform when generating output in diverse target languages, including the CJK (Chinese, Japanese, Korean) and Romance families. The shape of the plots confirms that our selected metrics, as NISQA-MOS, ChrF++, and Energy Cont. Sim., work reliably across very different languages, keeping a clear distinctio… view at source ↗

**Figure 8.** Figure 8: Per-language evaluation profiles across COMPASS dimensions on the CVSS X [PITH_FULL_IMAGE:figures/full_fig_p028_8.png] view at source ↗

**Figure 9.** Figure 9: Language Proficiency Profile, FLEURS EN→X. profile fluctuates more aggressively when handling diverse, noisier acoustic inputs (FLEURS) than when it is evaluating cleaner corpora (CVSS). These directional and corpus-driven asymmetries further complement our findings in RQ2 by proving that X→EN and EN→X tracks present fundamentally distinct evaluation bottlenecks, where the specific source or target langu… view at source ↗

read the original abstract

Speech-to-speech translation (S2ST) has advanced rapidly, but offline evaluation lacks a unified protocol: studies report non-overlapping metric subsets, preventing direct comparisons. We introduce COMPASS, a unified and reproducible benchmarking framework integrating 46 metrics across eight dimensions, and deploy it on 1,248 model-language configurations from FLEURS and CVSS, spanning cascaded and end-to-end architectures over ten language pairs. Architectures exhibit complementary strengths: best-vs-worst gaps exceed 30\% on naturalness and speaker preservation but remain within a few points on translation quality, so single-metric rankings systematically misrepresent system quality. Correlation filtering reduces 46 metrics to 10 per direction, with three axes requiring different metrics across X$\to$EN and EN$\to$X (e.g., TER/UTMOS vs. ChrF++/NISQA-MOS); these subsets preserve rankings (Spearman's $\rho>0.80$) while cutting evaluation time by $\approx 2.5\times$. Human validation across dubbing, podcasts, and medical domains shows standalone MOS predictors fail to predict listener preference, while top domain-specific metrics correlate with human judgment ($\rho \geq 0.90$). We release COMPASS as a foundation for domain-aware S2ST evaluation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

COMPASS is a useful release for standardizing S2ST metrics with some solid empirical backing, though the reduced metric sets could use more validation to confirm they aren't overfit to the test data.

read the letter

The paper's main contribution is releasing COMPASS, which bundles 46 metrics across eight dimensions for speech-to-speech translation and tests it on over a thousand model configs from FLEURS and CVSS. They find that you can reduce to about 10 metrics per translation direction while keeping Spearman's rho above 0.8 on rankings, and that different architectures do better on different aspects like naturalness versus translation quality.

This is useful because it highlights how single metrics can mislead and provides some domain-specific human correlation data showing that generic MOS doesn't cut it. The scale of the experiments and the release of the framework are clear positives for reproducibility in a field where results are often hard to compare.

One soft spot is the metric reduction step. The filtering is done on the same 1248 configurations and two datasets, so the chosen subsets might be tuned to those particular setups rather than general. Without held-out validation on new data, it's hard to know if the 10-metric lists will hold up elsewhere. The abstract also skips details on selecting the initial 46 or setting correlation thresholds.

Overall, this is the kind of paper that helps a niche community standardize their evaluations. Readers working on S2ST models would get value from the framework and the findings on complementary strengths. It deserves a serious referee because the scale of the experiments and the release make it worth checking the methods in full.

Recommendation: Send it for peer review, with a request to add cross-validation for the reduced metric sets.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces COMPASS, a unified benchmarking framework integrating 46 metrics across eight dimensions for speech-to-speech translation (S2ST). It evaluates 1,248 model-language configurations from FLEURS and CVSS spanning cascaded and end-to-end architectures over ten language pairs. Key claims are that architectures show complementary strengths (best-vs-worst gaps >30% on naturalness/speaker preservation but small on translation quality, so single-metric rankings misrepresent quality), that correlation filtering reduces metrics to 10 per direction (with direction-specific choices like TER/UTMOS vs. ChrF++/NISQA-MOS) while preserving rankings (Spearman's ρ>0.80) and cutting evaluation time ~2.5×, and that human validation across domains shows standalone MOS predictors fail while top domain-specific metrics correlate with judgments (ρ≥0.90).

Significance. If the results hold, this establishes a reproducible, multi-metric protocol that directly addresses fragmentation in S2ST evaluation literature. The demonstration of architecture complementarity, the provision of reduced yet ranking-preserving metric subsets, and the domain-specific human validation could improve comparability, efficiency, and reliability of future S2ST assessments.

major comments (2)

[correlation filtering procedure (abstract and methods)] The correlation filtering procedure that reduces 46 metrics to 10 per direction is applied to the same 1,248 configurations and FLEURS/CVSS datasets used for all architecture comparisons and ranking preservation checks. This in-sample selection risks producing dataset-specific subsets without held-out validation on new models, languages, or domains, directly weakening the claim that the reduced subsets preserve essential quality information (ρ>0.80) and support reliable domain-aware evaluation.
[methods and abstract] No details are provided on the selection criteria for the initial 46 metrics across the eight dimensions, the exact correlation thresholds used for filtering, or the controls present in the 1,248 configurations. These omissions are load-bearing for the central claims on metric reduction and the superiority of the 10-metric subsets.

minor comments (1)

[abstract] The abstract introduces X→EN and EN→X without prior definition of the language-pair conventions.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and commit to revisions that strengthen the manuscript without misrepresenting our results.

read point-by-point responses

Referee: [correlation filtering procedure (abstract and methods)] The correlation filtering procedure that reduces 46 metrics to 10 per direction is applied to the same 1,248 configurations and FLEURS/CVSS datasets used for all architecture comparisons and ranking preservation checks. This in-sample selection risks producing dataset-specific subsets without held-out validation on new models, languages, or domains, directly weakening the claim that the reduced subsets preserve essential quality information (ρ>0.80) and support reliable domain-aware evaluation.

Authors: We agree this is a valid methodological concern. The reduction was performed in-sample on the full set of 1,248 configurations without a separate held-out set of models or domains. While the configurations are diverse (spanning cascaded and end-to-end models, ten language pairs, and two source datasets), this does not fully substitute for out-of-sample validation. We will revise the manuscript to explicitly acknowledge this limitation in the Methods and Discussion sections and to recommend that users of the reduced subsets perform held-out checks when applying them to new data. revision: partial
Referee: [methods and abstract] No details are provided on the selection criteria for the initial 46 metrics across the eight dimensions, the exact correlation thresholds used for filtering, or the controls present in the 1,248 configurations. These omissions are load-bearing for the central claims on metric reduction and the superiority of the 10-metric subsets.

Authors: We agree that these details are necessary for reproducibility and for evaluating the claims. The current manuscript does not provide them. We will add a dedicated subsection (and appendix) that specifies: (i) the criteria used to compile the initial 46 metrics (standard coverage of the eight evaluation dimensions in the S2ST and related literature), (ii) the precise correlation thresholds and procedure (including the correlation measure and redundancy cutoff), and (iii) the configuration controls (model families, training regimes, and dataset handling). We will also include pseudocode for the filtering step. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical benchmarking without derivations

full rationale

The paper is a purely empirical benchmarking study that applies 46 external metrics to 1248 model configurations and performs correlation-based filtering followed by rank-preservation checks. No equations, first-principles derivations, fitted parameters renamed as predictions, or self-referential definitions appear in the work. The filtering procedure is a transparent data-driven reduction whose output is evaluated directly on the same corpus; this is standard empirical practice and does not reduce any claimed result to its inputs by construction. No load-bearing self-citations or uniqueness theorems are invoked. The study is therefore self-contained against external benchmarks and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the representativeness of the chosen metrics and datasets; no new free parameters, axioms beyond standard evaluation assumptions, or invented entities are introduced in the abstract.

axioms (1)

domain assumption The eight dimensions and 46 metrics together provide a comprehensive and non-redundant view of S2ST quality.
Invoked by the decision to integrate exactly these metrics and to perform correlation filtering on them.

pith-pipeline@v0.9.1-grok · 5775 in / 1356 out tokens · 32299 ms · 2026-06-28T10:31:25.818427+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

76 extracted references · 18 canonical work pages · 5 internal anchors

[1]

Aho and Jeffrey D

Alfred V. Aho and Jeffrey D. Ullman , title =. 1972

1972
[2]

Publications Manual , year = "1983", publisher =

1983
[3]

Chandra and Dexter C

Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year = "1981", title =. doi:10.1145/322234.322243

work page doi:10.1145/322234.322243 1981
[4]

Scalable training of

Andrew, Galen and Gao, Jianfeng , booktitle=. Scalable training of
[5]

Dan Gusfield , title =. 1997

1997
[6]

Tetreault , title =

Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =

2015
[7]

A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =

Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =
[8]

1997 , organization=

Lavie, Alon and Waibel, Alex and Levin, Lori and Finke, Michael and Gates, Donna and Gavalda, Marsal and Zeppenfeld, Torsten and Zhan, Puming , booktitle=. 1997 , organization=

1997
[9]

Nakamura, Satoshi and Markov, Konstantin and Nakaiwa, Hiromi and Kikui, Gen-ichiro and Kawai, Hisashi and Jitsuhiro, Takatoshi and Zhang, J-S and Yamamoto, Hirofumi and Sumita, Eiichiro and Yamamoto, Seiichi , journal=. The. 2006 , publisher=

2006
[10]

Direct Speech-to-Speech Translation with a Sequence-to-Sequence Model , author=. Proc. Interspeech 2019 , pages=

2019
[11]

International conference on machine learning , pages=

Translatotron 2: High-quality direct speech-to-speech translation with voice preservation , author=. International conference on machine learning , pages=. 2022 , organization=

2022
[12]

ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=

Translatotron 3: Speech to speech translation with monolingual data , author=. ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=. 2024 , organization=

2024
[13]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova

Seamlessm4t: Massively multilingual & multimodal machine translation , author=. arXiv preprint arXiv:2308.11596 , year=

work page arXiv
[14]

Available: https://arxiv.org/abs/2312.05187

Seamless: Multilingual Expressive and Streaming Speech Translation , author=. arXiv preprint arXiv:2312.05187 , year=

work page arXiv
[15]

2025 , journal=

Qwen2.5-Omni Technical Report , author=. 2025 , journal=

2025
[16]

Qwen3-Omni Technical Report

Qwen3-omni technical report , author=. arXiv preprint arXiv:2509.17765 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[17]

International conference on machine learning , pages=

Robust speech recognition via large-scale weak supervision , author=. International conference on machine learning , pages=. 2023 , organization=

2023
[18]

No Language Left Behind: Scaling Human-Centered Machine Translation

No language left behind: Scaling human-centered machine translation , author=. arXiv preprint arXiv:2207.04672 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[19]

Gemma 3 Technical Report

Gemma 3 technical report , author=. arXiv preprint arXiv:2503.19786 , volume=. 2025 , publisher=

work page internal anchor Pith review Pith/arXiv arXiv 2025
[20]

2025 , howpublished =

2025
[21]

CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training

Cosyvoice 3: Towards in-the-wild speech generation via scaling-up and post-training , author=. arXiv preprint arXiv:2505.17589 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[22]

Proceedings of the 40th annual meeting of the Association for Computational Linguistics , pages=

Bleu: a method for automatic evaluation of machine translation , author=. Proceedings of the 40th annual meeting of the Association for Computational Linguistics , pages=
[23]

A Study of Translation Edit Rate with Targeted Human Annotation

Snover, Matthew and Dorr, Bonnie and Schwartz, Rich and Micciulla, Linnea and Makhoul, John. A Study of Translation Edit Rate with Targeted Human Annotation. Proceedings of the 7th Conference of the Association for Machine Translation in the Americas: Technical Papers. 2006

2006
[24]

Rei, Ricardo and Stewart, Craig and Farinha, Ana C and Lavie, Alon , booktitle=
[25]

Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Chen, Mingda and Duquenne, Paul-Ambroise and Andrews, Pierre and Kao, Justine and Mourachko, Alexandre and Schwenk, Holger and Costa-juss. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
[26]

Saeki, Takaaki and Xin, Detai and Nakata, Wataru and Koriyama, Tomoki and Takamichi, Shinnosuke and Saruwatari, Hiroshi , booktitle=
[27]

Interspeech 2021 , year=

Mittag, Gabriel and Naderi, Babak and Chehadi, Assmaa and M. Interspeech 2021 , year=

2021
[28]

IEEE Journal of Selected Topics in Signal Processing , volume=

Wavlm: Large-scale self-supervised pre-training for full stack speech processing , author=. IEEE Journal of Selected Topics in Signal Processing , volume=. 2022 , publisher=

2022
[29]

Desplanques, Brecht and Thienpondt, Jenthe and Demuynck, Kris , booktitle=
[30]

Findings of the Association for Computational Linguistics: ACL 2024 , pages=

emotion2vec: Self-supervised pre-training for speech emotion representation , author=. Findings of the Association for Computational Linguistics: ACL 2024 , pages=

2024
[31]

Isometric

Lakew, Surafel M and Virkar, Yogesh and Mathur, Prashant and Federico, Marcello , booktitle=. Isometric. 2022 , organization=

2022
[32]

arXiv preprint arXiv:2302.12979 , year=

Jointly optimizing translations and speech timing to improve isochrony in automatic dubbing , author=. arXiv preprint arXiv:2302.12979 , year=

work page arXiv
[33]

ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=

Duration modeling of neural tts for automatic dubbing , author=. ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=. 2022 , organization=

2022
[34]

2022 IEEE Spoken Language Technology Workshop (SLT) , pages=

Fleurs: Few-shot learning evaluation of universal representations of speech , author=. 2022 IEEE Spoken Language Technology Workshop (SLT) , pages=. 2023 , organization=

2022
[35]

Proceedings of the 58th annual meeting of the association for computational linguistics , pages=

Unsupervised cross-lingual representation learning at scale , author=. Proceedings of the 58th annual meeting of the association for computational linguistics , pages=
[36]

Advances in neural information processing systems , volume=

wav2vec 2.0: A framework for self-supervised learning of speech representations , author=. Advances in neural information processing systems , volume=
[37]

IEEE Transactions on Information theory , volume=

Divergence measures based on the Shannon entropy , author=. IEEE Transactions on Information theory , volume=. 1991 , publisher=

1991
[38]

Communication methods and measures , volume=

Agreement and information in the reliability of coding , author=. Communication methods and measures , volume=. 2011 , publisher=

2011
[39]

Dingdong Wang, Jincenzi Wu, Junan Li, Dongchao Yang, Xueyuan Chen, Tianhua Zhang, and Helen Meng

Covost 2 and massively multilingual speech-to-text translation , author=. arXiv preprint arXiv:2007.10310 , year=

work page arXiv 2007
[40]

Jia, Ye and Ramanovich, Michelle Tadmor and Wang, Quan and Zen, Heiga , booktitle=
[41]

Europarl-

Iranzo-S. Europarl-. ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=. 2020 , organization=

2020
[42]

2025 , organization=

Hu, Yuxuan and Wu, Haibin and Fan, Ruchao and Wang, Xiaofei and Lu, Heng and Qian, Yao and Li, Jinyu , booktitle=. 2025 , organization=

2025
[43]

Chen, Sirou and Yahata, Sakiko and Shimizu, Shuichiro and Yang, Zhengdong and Li, Yihang and Chu, Chenhui and Kurohashi, Sadao , booktitle=
[44]

Le-Duc, Khai and Tran, Tuyen and Tat, Bach Phan and Bui, Nguyen Kim Hai and Anh, Quan Dang and Tran, Hung-Phong and Nguyen, Thanh Thuy and Nguyen, Ly and Phan, Tuan Minh and Tran, Thi Thu Phuong and others , booktitle=
[45]

Preprint, arXiv:2512.17648

Simulstream: Open-Source Toolkit for Evaluation and Demonstration of Streaming Speech-to-Text Translation Systems , author=. arXiv preprint arXiv:2512.17648 , year=

work page arXiv
[46]

2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) , pages=

Assessing evaluation metrics for speech-to-speech translation , author=. 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) , pages=. 2021 , organization=

2021
[47]

Transactions of the Association for Computational Linguistics , volume=

Experts, errors, and context: A large-scale study of human evaluation for machine translation , author=. Transactions of the Association for Computational Linguistics , volume=
[48]

Tangled up in

Mathur, Nitika and Baldwin, Timothy and Cohn, Trevor , booktitle=. Tangled up in
[49]

Proceedings of the 20th International Conference on Spoken Language Translation (IWSLT 2023) , pages=

Mach. Proceedings of the 20th International Conference on Spoken Language Translation (IWSLT 2023) , pages=

2023
[50]

Cheng, Sitong and Bian, Weizhen and Wang, Xinsheng and Yuan, Ruibin and Chen, Jianyi and Yin, Shunshun and Guo, Yike and Xue, Wei , journal=
[51]

arXiv preprint arXiv:2511.20974 , year=

RosettaSpeech: Zero-Shot Speech-to-Speech Translation from Monolingual Data , author=. arXiv preprint arXiv:2511.20974 , year=

work page arXiv
[52]

Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Direct speech-to-speech translation with discrete units , author=. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
[53]

2023 , organization=

Karunya, S and Jalakandeshwaran, M and Uma, R and others , booktitle=. 2023 , organization=

2023
[54]

Cascade versus direct speech translation: Do the differences still make a difference? , author=. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) , pages=
[55]

2026 , howpublished =

Gemma4 , author =. 2026 , howpublished =

2026
[56]

VoxPopuli: A large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpretation , author=. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) , pages=
[57]

ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=

Generalization ability of MOS prediction networks , author=. ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=. 2022 , organization=

2022
[58]

V oxtral,

Voxtral , author=. arXiv preprint arXiv:2507.13264 , year=

work page arXiv
[59]

Qwen3 Technical Report

Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[60]

Proceedings of the tenth workshop on statistical machine translation , pages=

Popovi. Proceedings of the tenth workshop on statistical machine translation , pages=
[61]

Proceedings of the second conference on machine translation , pages=

Popovi. Proceedings of the second conference on machine translation , pages=
[62]

(2022) ’COMET-22: Unbabel-IST 2022 Submission for the Metrics Shared Task’ in Proceedings of the Seventh Conference on Machine Translation

Rei, Ricardo and C. de Souza, Jos \'e G. and Alves, Duarte and Zerva, Chrysoula and Farinha, Ana C and Glushkova, Taisiya and Lavie, Alon and Coheur, Luisa and Martins, Andr \'e F. T. COMET -22: Unbabel- IST 2022 Submission for the Metrics Shared Task. Proceedings of the Seventh Conference on Machine Translation (WMT). 2022. doi:10.18653/v1/2022.wmt-1.52

work page doi:10.18653/v1/2022.wmt-1.52 2022
[63]

doi:10.21437/Interspeech.2025-891 , issn =

Bornali Phukon and Xiuwen Zheng and Mark Hasegawa-Johnson , year =. doi:10.21437/Interspeech.2025-891 , issn =

work page doi:10.21437/interspeech.2025-891 2025
[64]

International Conference on Learning Representations , year=

BERTScore: Evaluating Text Generation with BERT , author=. International Conference on Learning Representations , year=
[65]

A Call for Clarity in Reporting BLEU Scores

Post, Matt. A Call for Clarity in Reporting BLEU Scores. Proceedings of the Third Conference on Machine Translation: Research Papers. 2018

2018
[66]

Glot International , year =

Boersma, Paul , title =. Glot International , year =
[67]

Dubbing in Practice: A Large Scale Study of Human Localization With Insights for Automatic Dubbing

Brannon, William and Virkar, Yogesh and Thompson, Brian. Dubbing in Practice: A Large Scale Study of Human Localization With Insights for Automatic Dubbing. Transactions of the Association for Computational Linguistics. 2023. doi:10.1162/tacl_a_00551

work page doi:10.1162/tacl_a_00551 2023
[68]

Chenyang Le and Yao Qian and Dongmei Wang and Long Zhou and Shujie LIU and Xiaofei Wang and Midia Yousefi and Yanmin Qian and Jinyu Li and Michael Zeng , booktitle=. Trans. 2024 , url=

2024
[69]

Findings of the

Anastasopoulos, Antonios and Barrault, Lo. Findings of the. Proceedings of the 19th international conference on spoken language translation (IWSLT 2022) , pages=

2022
[70]

Journal of Machine Learning Research , volume=

Scaling speech technology to 1,000+ languages , author=. Journal of Machine Learning Research , volume=
[71]

Rozanov, Nikolai and Pankov, Vikentiy and Mukhutdinov, Dmitrii and Vypirailenko, Dima , booktitle=
[72]

Findings of the IWSLT 2025 Evaluation Campaign

Abdulmumin, Idris and Agostinelli, Victor and Alum. Findings of the IWSLT 2025 Evaluation Campaign. Proceedings of the 22nd International Conference on Spoken Language Translation (IWSLT 2025). 2025. doi:10.18653/v1/2025.iwslt-1.44

work page doi:10.18653/v1/2025.iwslt-1.44 2025
[73]

FINDINGS OF THE IWSLT 2024 EVALUATION CAMPAIGN

Ahmad, Ibrahim Said and Anastasopoulos, Antonios and Bojar, Ond. FINDINGS OF THE IWSLT 2024 EVALUATION CAMPAIGN. Proceedings of the 21st International Conference on Spoken Language Translation (IWSLT 2024). 2024. doi:10.18653/v1/2024.iwslt-1.1

work page doi:10.18653/v1/2024.iwslt-1.1 2024
[74]

Ma, Mingbo and Huang, Liang and Xiong, Hao and Zheng, Renjie and Liu, Kaibo and Zheng, Baigong and Zhang, Chuanqiang and He, Zhongjun and Liu, Hairong and Li, Xing and others , booktitle=
[75]

Ma, Xutai and Dousti, Mohammad Javad and Wang, Changhan and Gu, Jiatao and Pino, Juan , booktitle=
[76]

Proceedings of the Third Workshop on Automatic Simultaneous Translation , pages=

Over-generation cannot be rewarded: Length-adaptive average lagging for simultaneous speech translation , author=. Proceedings of the Third Workshop on Automatic Simultaneous Translation , pages=

[1] [1]

Aho and Jeffrey D

Alfred V. Aho and Jeffrey D. Ullman , title =. 1972

1972

[2] [2]

Publications Manual , year = "1983", publisher =

1983

[3] [3]

Chandra and Dexter C

Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year = "1981", title =. doi:10.1145/322234.322243

work page doi:10.1145/322234.322243 1981

[4] [4]

Scalable training of

Andrew, Galen and Gao, Jianfeng , booktitle=. Scalable training of

[5] [5]

Dan Gusfield , title =. 1997

1997

[6] [6]

Tetreault , title =

Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =

2015

[7] [7]

A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =

Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =

[8] [8]

1997 , organization=

Lavie, Alon and Waibel, Alex and Levin, Lori and Finke, Michael and Gates, Donna and Gavalda, Marsal and Zeppenfeld, Torsten and Zhan, Puming , booktitle=. 1997 , organization=

1997

[9] [9]

Nakamura, Satoshi and Markov, Konstantin and Nakaiwa, Hiromi and Kikui, Gen-ichiro and Kawai, Hisashi and Jitsuhiro, Takatoshi and Zhang, J-S and Yamamoto, Hirofumi and Sumita, Eiichiro and Yamamoto, Seiichi , journal=. The. 2006 , publisher=

2006

[10] [10]

Direct Speech-to-Speech Translation with a Sequence-to-Sequence Model , author=. Proc. Interspeech 2019 , pages=

2019

[11] [11]

International conference on machine learning , pages=

Translatotron 2: High-quality direct speech-to-speech translation with voice preservation , author=. International conference on machine learning , pages=. 2022 , organization=

2022

[12] [12]

ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=

Translatotron 3: Speech to speech translation with monolingual data , author=. ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=. 2024 , organization=

2024

[13] [13]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova

Seamlessm4t: Massively multilingual & multimodal machine translation , author=. arXiv preprint arXiv:2308.11596 , year=

work page arXiv

[14] [14]

Available: https://arxiv.org/abs/2312.05187

Seamless: Multilingual Expressive and Streaming Speech Translation , author=. arXiv preprint arXiv:2312.05187 , year=

work page arXiv

[15] [15]

2025 , journal=

Qwen2.5-Omni Technical Report , author=. 2025 , journal=

2025

[16] [16]

Qwen3-Omni Technical Report

Qwen3-omni technical report , author=. arXiv preprint arXiv:2509.17765 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[17] [17]

International conference on machine learning , pages=

Robust speech recognition via large-scale weak supervision , author=. International conference on machine learning , pages=. 2023 , organization=

2023

[18] [18]

No Language Left Behind: Scaling Human-Centered Machine Translation

No language left behind: Scaling human-centered machine translation , author=. arXiv preprint arXiv:2207.04672 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[19] [19]

Gemma 3 Technical Report

Gemma 3 technical report , author=. arXiv preprint arXiv:2503.19786 , volume=. 2025 , publisher=

work page internal anchor Pith review Pith/arXiv arXiv 2025

[20] [20]

2025 , howpublished =

2025

[21] [21]

CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training

Cosyvoice 3: Towards in-the-wild speech generation via scaling-up and post-training , author=. arXiv preprint arXiv:2505.17589 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[22] [22]

Proceedings of the 40th annual meeting of the Association for Computational Linguistics , pages=

Bleu: a method for automatic evaluation of machine translation , author=. Proceedings of the 40th annual meeting of the Association for Computational Linguistics , pages=

[23] [23]

A Study of Translation Edit Rate with Targeted Human Annotation

Snover, Matthew and Dorr, Bonnie and Schwartz, Rich and Micciulla, Linnea and Makhoul, John. A Study of Translation Edit Rate with Targeted Human Annotation. Proceedings of the 7th Conference of the Association for Machine Translation in the Americas: Technical Papers. 2006

2006

[24] [24]

Rei, Ricardo and Stewart, Craig and Farinha, Ana C and Lavie, Alon , booktitle=

[25] [25]

Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Chen, Mingda and Duquenne, Paul-Ambroise and Andrews, Pierre and Kao, Justine and Mourachko, Alexandre and Schwenk, Holger and Costa-juss. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

[26] [26]

Saeki, Takaaki and Xin, Detai and Nakata, Wataru and Koriyama, Tomoki and Takamichi, Shinnosuke and Saruwatari, Hiroshi , booktitle=

[27] [27]

Interspeech 2021 , year=

Mittag, Gabriel and Naderi, Babak and Chehadi, Assmaa and M. Interspeech 2021 , year=

2021

[28] [28]

IEEE Journal of Selected Topics in Signal Processing , volume=

Wavlm: Large-scale self-supervised pre-training for full stack speech processing , author=. IEEE Journal of Selected Topics in Signal Processing , volume=. 2022 , publisher=

2022

[29] [29]

Desplanques, Brecht and Thienpondt, Jenthe and Demuynck, Kris , booktitle=

[30] [30]

Findings of the Association for Computational Linguistics: ACL 2024 , pages=

emotion2vec: Self-supervised pre-training for speech emotion representation , author=. Findings of the Association for Computational Linguistics: ACL 2024 , pages=

2024

[31] [31]

Isometric

Lakew, Surafel M and Virkar, Yogesh and Mathur, Prashant and Federico, Marcello , booktitle=. Isometric. 2022 , organization=

2022

[32] [32]

arXiv preprint arXiv:2302.12979 , year=

Jointly optimizing translations and speech timing to improve isochrony in automatic dubbing , author=. arXiv preprint arXiv:2302.12979 , year=

work page arXiv

[33] [33]

ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=

Duration modeling of neural tts for automatic dubbing , author=. ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=. 2022 , organization=

2022

[34] [34]

2022 IEEE Spoken Language Technology Workshop (SLT) , pages=

Fleurs: Few-shot learning evaluation of universal representations of speech , author=. 2022 IEEE Spoken Language Technology Workshop (SLT) , pages=. 2023 , organization=

2022

[35] [35]

Proceedings of the 58th annual meeting of the association for computational linguistics , pages=

Unsupervised cross-lingual representation learning at scale , author=. Proceedings of the 58th annual meeting of the association for computational linguistics , pages=

[36] [36]

Advances in neural information processing systems , volume=

wav2vec 2.0: A framework for self-supervised learning of speech representations , author=. Advances in neural information processing systems , volume=

[37] [37]

IEEE Transactions on Information theory , volume=

Divergence measures based on the Shannon entropy , author=. IEEE Transactions on Information theory , volume=. 1991 , publisher=

1991

[38] [38]

Communication methods and measures , volume=

Agreement and information in the reliability of coding , author=. Communication methods and measures , volume=. 2011 , publisher=

2011

[39] [39]

Dingdong Wang, Jincenzi Wu, Junan Li, Dongchao Yang, Xueyuan Chen, Tianhua Zhang, and Helen Meng

Covost 2 and massively multilingual speech-to-text translation , author=. arXiv preprint arXiv:2007.10310 , year=

work page arXiv 2007

[40] [40]

Jia, Ye and Ramanovich, Michelle Tadmor and Wang, Quan and Zen, Heiga , booktitle=

[41] [41]

Europarl-

Iranzo-S. Europarl-. ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=. 2020 , organization=

2020

[42] [42]

2025 , organization=

Hu, Yuxuan and Wu, Haibin and Fan, Ruchao and Wang, Xiaofei and Lu, Heng and Qian, Yao and Li, Jinyu , booktitle=. 2025 , organization=

2025

[43] [43]

Chen, Sirou and Yahata, Sakiko and Shimizu, Shuichiro and Yang, Zhengdong and Li, Yihang and Chu, Chenhui and Kurohashi, Sadao , booktitle=

[44] [44]

Le-Duc, Khai and Tran, Tuyen and Tat, Bach Phan and Bui, Nguyen Kim Hai and Anh, Quan Dang and Tran, Hung-Phong and Nguyen, Thanh Thuy and Nguyen, Ly and Phan, Tuan Minh and Tran, Thi Thu Phuong and others , booktitle=

[45] [45]

Preprint, arXiv:2512.17648

Simulstream: Open-Source Toolkit for Evaluation and Demonstration of Streaming Speech-to-Text Translation Systems , author=. arXiv preprint arXiv:2512.17648 , year=

work page arXiv

[46] [46]

2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) , pages=

Assessing evaluation metrics for speech-to-speech translation , author=. 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) , pages=. 2021 , organization=

2021

[47] [47]

Transactions of the Association for Computational Linguistics , volume=

Experts, errors, and context: A large-scale study of human evaluation for machine translation , author=. Transactions of the Association for Computational Linguistics , volume=

[48] [48]

Tangled up in

Mathur, Nitika and Baldwin, Timothy and Cohn, Trevor , booktitle=. Tangled up in

[49] [49]

Proceedings of the 20th International Conference on Spoken Language Translation (IWSLT 2023) , pages=

Mach. Proceedings of the 20th International Conference on Spoken Language Translation (IWSLT 2023) , pages=

2023

[50] [50]

Cheng, Sitong and Bian, Weizhen and Wang, Xinsheng and Yuan, Ruibin and Chen, Jianyi and Yin, Shunshun and Guo, Yike and Xue, Wei , journal=

[51] [51]

arXiv preprint arXiv:2511.20974 , year=

RosettaSpeech: Zero-Shot Speech-to-Speech Translation from Monolingual Data , author=. arXiv preprint arXiv:2511.20974 , year=

work page arXiv

[52] [52]

Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Direct speech-to-speech translation with discrete units , author=. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

[53] [53]

2023 , organization=

Karunya, S and Jalakandeshwaran, M and Uma, R and others , booktitle=. 2023 , organization=

2023

[54] [54]

Cascade versus direct speech translation: Do the differences still make a difference? , author=. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) , pages=

[55] [55]

2026 , howpublished =

Gemma4 , author =. 2026 , howpublished =

2026

[56] [56]

VoxPopuli: A large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpretation , author=. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) , pages=

[57] [57]

ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=

Generalization ability of MOS prediction networks , author=. ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=. 2022 , organization=

2022

[58] [58]

V oxtral,

Voxtral , author=. arXiv preprint arXiv:2507.13264 , year=

work page arXiv

[59] [59]

Qwen3 Technical Report

Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[60] [60]

Proceedings of the tenth workshop on statistical machine translation , pages=

Popovi. Proceedings of the tenth workshop on statistical machine translation , pages=

[61] [61]

Proceedings of the second conference on machine translation , pages=

Popovi. Proceedings of the second conference on machine translation , pages=

[62] [62]

(2022) ’COMET-22: Unbabel-IST 2022 Submission for the Metrics Shared Task’ in Proceedings of the Seventh Conference on Machine Translation

Rei, Ricardo and C. de Souza, Jos \'e G. and Alves, Duarte and Zerva, Chrysoula and Farinha, Ana C and Glushkova, Taisiya and Lavie, Alon and Coheur, Luisa and Martins, Andr \'e F. T. COMET -22: Unbabel- IST 2022 Submission for the Metrics Shared Task. Proceedings of the Seventh Conference on Machine Translation (WMT). 2022. doi:10.18653/v1/2022.wmt-1.52

work page doi:10.18653/v1/2022.wmt-1.52 2022

[63] [63]

doi:10.21437/Interspeech.2025-891 , issn =

Bornali Phukon and Xiuwen Zheng and Mark Hasegawa-Johnson , year =. doi:10.21437/Interspeech.2025-891 , issn =

work page doi:10.21437/interspeech.2025-891 2025

[64] [64]

International Conference on Learning Representations , year=

BERTScore: Evaluating Text Generation with BERT , author=. International Conference on Learning Representations , year=

[65] [65]

A Call for Clarity in Reporting BLEU Scores

Post, Matt. A Call for Clarity in Reporting BLEU Scores. Proceedings of the Third Conference on Machine Translation: Research Papers. 2018

2018

[66] [66]

Glot International , year =

Boersma, Paul , title =. Glot International , year =

[67] [67]

Dubbing in Practice: A Large Scale Study of Human Localization With Insights for Automatic Dubbing

Brannon, William and Virkar, Yogesh and Thompson, Brian. Dubbing in Practice: A Large Scale Study of Human Localization With Insights for Automatic Dubbing. Transactions of the Association for Computational Linguistics. 2023. doi:10.1162/tacl_a_00551

work page doi:10.1162/tacl_a_00551 2023

[68] [68]

Chenyang Le and Yao Qian and Dongmei Wang and Long Zhou and Shujie LIU and Xiaofei Wang and Midia Yousefi and Yanmin Qian and Jinyu Li and Michael Zeng , booktitle=. Trans. 2024 , url=

2024

[69] [69]

Findings of the

Anastasopoulos, Antonios and Barrault, Lo. Findings of the. Proceedings of the 19th international conference on spoken language translation (IWSLT 2022) , pages=

2022

[70] [70]

Journal of Machine Learning Research , volume=

Scaling speech technology to 1,000+ languages , author=. Journal of Machine Learning Research , volume=

[71] [71]

Rozanov, Nikolai and Pankov, Vikentiy and Mukhutdinov, Dmitrii and Vypirailenko, Dima , booktitle=

[72] [72]

Findings of the IWSLT 2025 Evaluation Campaign

Abdulmumin, Idris and Agostinelli, Victor and Alum. Findings of the IWSLT 2025 Evaluation Campaign. Proceedings of the 22nd International Conference on Spoken Language Translation (IWSLT 2025). 2025. doi:10.18653/v1/2025.iwslt-1.44

work page doi:10.18653/v1/2025.iwslt-1.44 2025

[73] [73]

FINDINGS OF THE IWSLT 2024 EVALUATION CAMPAIGN

Ahmad, Ibrahim Said and Anastasopoulos, Antonios and Bojar, Ond. FINDINGS OF THE IWSLT 2024 EVALUATION CAMPAIGN. Proceedings of the 21st International Conference on Spoken Language Translation (IWSLT 2024). 2024. doi:10.18653/v1/2024.iwslt-1.1

work page doi:10.18653/v1/2024.iwslt-1.1 2024

[74] [74]

Ma, Mingbo and Huang, Liang and Xiong, Hao and Zheng, Renjie and Liu, Kaibo and Zheng, Baigong and Zhang, Chuanqiang and He, Zhongjun and Liu, Hairong and Li, Xing and others , booktitle=

[75] [75]

Ma, Xutai and Dousti, Mohammad Javad and Wang, Changhan and Gu, Jiatao and Pino, Juan , booktitle=

[76] [76]

Proceedings of the Third Workshop on Automatic Simultaneous Translation , pages=

Over-generation cannot be rewarded: Length-adaptive average lagging for simultaneous speech translation , author=. Proceedings of the Third Workshop on Automatic Simultaneous Translation , pages=