pith. machine review for the scientific record. sign in

arxiv: 2512.16378 · v4 · submitted 2025-12-18 · 💻 cs.CL · cs.AI· cs.SD

Recognition: 2 theorem links

· Lean Theorem

Hearing to Translate: The Effectiveness of Speech Modality Integration into LLMs

Authors on Pith no claims yet

Pith reviewed 2026-05-16 21:45 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.SD
keywords speech translationSpeechLLMscascaded systemsspeech foundation modelsLLM integrationbenchmark evaluationmultilingual ST
0
0 comments X

The pith

Cascaded systems remain the most reliable for speech translation overall, though recent SpeechLLMs can match or outperform them in various settings.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether SpeechLLMs that take speech as direct input improve translation quality over traditional cascaded pipelines that first transcribe speech then translate the text. It runs a large-scale comparison of six SpeechLLMs against sixteen strong cascade combinations of speech foundation models and multilingual LLMs. The tests cover sixteen benchmarks, thirteen language pairs, and nine difficult conditions such as noisy, disfluent, and long-form speech. Cascades prove most dependable overall, yet newer SpeechLLMs reach or exceed cascade performance in several cases while pure speech foundation models trail both. The results show that some form of LLM integration, whether inside one model or across a pipeline, is required for high-quality speech translation.

Core claim

Across this extensive evaluation, we find that cascaded systems remain the most reliable solution overall, but most recent SpeechLLMs can match or even outperform cascades in various settings while SFMs lag behind both, highlighting that integrating an LLM, either within the model or in a pipeline, is essential for high-quality speech translation.

What carries the argument

Large-scale benchmarking of six SpeechLLMs against sixteen cascaded SFM-plus-LLM systems across sixteen benchmarks, thirteen language pairs, and nine conditions including noise and disfluency.

If this is right

  • Cascaded systems should be the default choice when reliability across varied inputs is the priority.
  • Recent SpeechLLMs become practical alternatives in specific language pairs or speech conditions where they perform on par or better.
  • Pure speech foundation models without LLM components are insufficient for competitive translation quality.
  • Any effective speech translation approach requires LLM capabilities, either embedded directly or added in a pipeline.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Hybrid designs that feed speech features into cascaded LLM pipelines could combine the strengths of both approaches.
  • The performance patterns suggest modular cascaded systems may retain an edge for consistency on highly variable real-world audio.
  • Similar direct-versus-cascaded comparisons could be run for other speech tasks such as summarization or question answering.
  • Practitioners may need condition-specific testing rather than a single architecture choice for production speech translation.

Load-bearing premise

The sixteen benchmarks, thirteen language pairs, nine conditions, and chosen systems adequately represent real-world speech translation challenges and current state-of-the-art methods.

What would settle it

A follow-up study on an expanded set of benchmarks or conditions where cascaded systems consistently outperform all tested SpeechLLMs would undermine the claim that recent SpeechLLMs can match or exceed cascades in various settings.

Figures

Figures reproduced from arXiv: 2512.16378 by Ahrii Kim, Carlos Escolano, Dominik Mach\'a\v{c}ek, Gerard I. G\'allego, Javier Garcia Gilabert, Jorge Iranzo-S\'anchez, Maike Z\"ufle, Patricia Schmidtova, Sara Papi, Vil\'em Zouhar, Zachary Hopton.

Figure 1
Figure 1. Figure 1: Plot showing the relationship between Gender Coreference Gap (∆F1♀♂) and Stereotypi￾cal Gap (∆S♀♂) across all evaluated systems. en→de en→es en→fr en→it en→nl en→pt en→zh de→en es→en it→en zh→en Whisper Seamless Canary OWSM Whisper+Aya Whisper+Gemma3 Whisper+Tower+ Seam.+Aya Seam.+Gemma3 Seam.+Tower+ Canary+Aya Canary+Gemma3 Canary+Tower+ OWSM+Aya OWSM+Gemma3 OWSM+Tower+ DeSTA2 Qwen2-Audio Phi-4-Multimodal… view at source ↗
Figure 2
Figure 2. Figure 2: Standard deviation of xCOMETQE S scores for ManDi (zh-en) and CommonAccent (all other directions) across source-language accent. Numerical values for all cells can be found in [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Screenshot of the Pearmut (Zouhar, 2026) annotation interface together with annotation guidelines. The annotator first listens to the source audio, then scans the three model outputs where they mark error spans with severities and categories. Lastly, the annotator assigns the final scores and proceeds to the next item. A Human Evaluation Interface Annotation guidelines and interface are shown in [PITH_FUL… view at source ↗
Figure 4
Figure 4. Figure 4: xCOMETQE S results for language pairs into English, broken down by source-language accent. zh-en results come from ManDi, while all other pairs represent CommonAccent results [PITH_FULL_IMAGE:figures/full_fig_p027_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: CommonAccent xCOMETQE S results for language pairs out of English, broken down by source speech accent [PITH_FULL_IMAGE:figures/full_fig_p028_5.png] view at source ↗
read the original abstract

As Large Language Models (LLMs) expand beyond text, integrating speech as a native modality has given rise to SpeechLLMs, which directly process spoken language and enable speech-to-text translation (ST) and other downstream tasks, bypassing traditional transcription-based pipelines. Whether this integration improves ST quality over established cascaded architectures, however, remains an open question. We present Hearing to Translate, the first comprehensive test suite rigorously benchmarking 6 state-of-the-art SpeechLLMs against 16 strong direct and cascade systems that couple leading speech foundation models (SFM), with multilingual LLMs. Our analysis spans 16 benchmarks, 13 language pairs, and 9 challenging conditions, including disfluent, noisy, and long-form speech. Across this extensive evaluation, we find that cascaded systems remain the most reliable solution overall, but most recent SpeechLLMs can match or even outperform cascades in various settings while SFMs lag behind both, highlighting that integrating an LLM, either within the model or in a pipeline, is essential for high-quality speech translation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 3 minor

Summary. The paper conducts a large-scale empirical benchmarking of 6 state-of-the-art SpeechLLMs against 16 cascaded and direct systems (pairing speech foundation models with multilingual LLMs) for speech-to-text translation. Evaluation spans 16 benchmarks, 13 language pairs, and 9 conditions including disfluent, noisy, and long-form speech. The central finding is that cascaded systems remain most reliable overall, recent SpeechLLMs can match or exceed them in select settings, SFMs lag behind, and LLM integration (either native or in pipeline) is essential for high-quality performance.

Significance. If the comparative rankings hold under scrutiny, the work supplies a valuable, broad empirical reference point for the speech-translation community on when native speech-modality integration in LLMs is advantageous versus when established cascades remain preferable. The explicit coverage of challenging acoustic and linguistic conditions strengthens the practical relevance of the conclusions.

major comments (3)
  1. [Abstract, §4] Abstract and §4 (Evaluation setup): the headline claim that 'cascaded systems remain the most reliable solution overall' is not accompanied by an explicit aggregate metric (e.g., mean rank or macro-averaged BLEU across all 16 benchmarks and 9 conditions) or by any statistical test establishing significance of the observed differences; without these, the 'most reliable' ranking rests on qualitative summary rather than a verifiable quantitative criterion.
  2. [§3.2, §4.3] §3.2 (Model selection) and §4.3 (Results): the assertion that the chosen 6 SpeechLLMs and 16 cascade combinations are representative of 'state-of-the-art' is load-bearing for the generalizability of the ranking, yet the paper provides no explicit inclusion criteria, recency cutoff, or ablation showing that omitted recent systems would not alter the relative ordering.
  3. [§4.1] §4.1 (Benchmarks and conditions): while 16 benchmarks and 9 conditions are enumerated, the manuscript does not report whether the same train/dev/test splits were used uniformly across all systems or whether any benchmark overlap exists that could inflate apparent performance gaps between cascaded and end-to-end models.
minor comments (3)
  1. [Throughout] Ensure every acronym (SFM, ST, SpeechLLM, etc.) is defined at first use in the main text, not only in the abstract.
  2. [§4.3] Tables reporting per-condition scores should include standard deviations or confidence intervals so readers can assess stability of the reported differences.
  3. [§6] The paper would benefit from a short paragraph in the conclusion explicitly listing the precise conditions under which SpeechLLMs outperform cascades, rather than leaving this to the reader to extract from the tables.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will incorporate revisions to enhance the quantitative rigor and transparency of the manuscript.

read point-by-point responses
  1. Referee: [Abstract, §4] Abstract and §4 (Evaluation setup): the headline claim that 'cascaded systems remain the most reliable solution overall' is not accompanied by an explicit aggregate metric (e.g., mean rank or macro-averaged BLEU across all 16 benchmarks and 9 conditions) or by any statistical test establishing significance of the observed differences; without these, the 'most reliable' ranking rests on qualitative summary rather than a verifiable quantitative criterion.

    Authors: We agree that an explicit aggregate metric would make the claim more verifiable. In the revision we will add a table of average ranks across all benchmarks and conditions, along with a brief discussion of why formal significance testing is difficult given benchmark heterogeneity. The conclusion will be qualified accordingly while retaining the core observation that cascades lead in the majority of settings. revision: yes

  2. Referee: [§3.2, §4.3] §3.2 (Model selection) and §4.3 (Results): the assertion that the chosen 6 SpeechLLMs and 16 cascade combinations are representative of 'state-of-the-art' is load-bearing for the generalizability of the ranking, yet the paper provides no explicit inclusion criteria, recency cutoff, or ablation showing that omitted recent systems would not alter the relative ordering.

    Authors: Selection was based on models published or released in 2024 with publicly reported speech-translation results at the time of experimentation. We will add explicit criteria in §3.2 (post-2023 publication, open weights or API availability, and coverage of both native SpeechLLM and cascade paradigms) and a short limitations paragraph noting that exhaustive ablation of every concurrent system is infeasible but that the chosen set spans the dominant architectural families. revision: yes

  3. Referee: [§4.1] §4.1 (Benchmarks and conditions): while 16 benchmarks and 9 conditions are enumerated, the manuscript does not report whether the same train/dev/test splits were used uniformly across all systems or whether any benchmark overlap exists that could inflate apparent performance gaps between cascaded and end-to-end models.

    Authors: All models were evaluated zero-shot on the official test partitions of each benchmark; no training or dev-set adaptation was performed for any system. We will insert a clarifying paragraph in §4.1 that lists the exact splits used and confirms the absence of any training-data overlap or leakage that could favor one paradigm. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical benchmarking study

full rationale

This paper is a pure empirical benchmarking study comparing 6 SpeechLLMs against 16 cascade and direct systems across 16 benchmarks, 13 language pairs, and 9 conditions. It contains no derivations, equations, fitted parameters, self-referential definitions, or load-bearing self-citations that reduce claims to inputs by construction. All results derive from direct evaluation of external model outputs on standard benchmarks, with no ansatzes, uniqueness theorems, or renamings of known results. The central claim about reliability of cascades versus SpeechLLMs is supported by the scale of the evaluation itself and does not rely on any internal circular step.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Empirical benchmarking paper with no mathematical derivations or new theoretical constructs; relies only on standard evaluation practices and existing models.

pith-pipeline@v0.9.0 · 5536 in / 1074 out tokens · 37602 ms · 2026-05-16T21:45:18.337930+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Pearmut: Human Evaluation of Translation Made Trivial

    cs.CL 2026-01 unverdicted novelty 5.0

    Pearmut is a platform that makes end-to-end human evaluation of translations as easy as automatic metrics by supporting DA, ESA, MQM and features like document context and attention checks.

Reference graph

Works this paper leans on

106 extracted references · 106 canonical work pages · cited by 1 Pith paper · 15 internal anchors

  1. [1]

    URL: " 'urlintro :=

    ENTRY address author booktitle chapter edition editor howpublished institution journal key month note number organization pages publisher school series title type volume year eprint doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRINGS urlintro eprinturl eprintpr...

  2. [2]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

  3. [3]

    u ker, Katsuhito Sudoh, Brian Thompson, Marco Turchi, Alex Waibel, Patrick Wilken, Rodolfo Zevallos, Vil \'e m Zouhar, and Maike Z \

    Idris Abdulmumin, Victor Agostinelli, Tanel Alum \"a e, Antonios Anastasopoulos, Luisa Bentivogli, Ond r ej Bojar, Claudia Borg, Fethi Bougares, Roldano Cattoni, Mauro Cettolo, Lizhong Chen, William Chen, Raj Dabre, Yannick Est \`e ve, Marcello Federico, Mark Fishel, Marco Gaido, D \'a vid Javorsk \'y , Marek Kasztelnik, Fortun \'e Kponou, Mateusz Krubi \...

  4. [4]

    Abdelrahman Abouelenin, Atabak Ashfaq, Adam Atkinson, Hany Awadalla, Nguyen Bach, Jianmin Bao, Alon Benhaim, Martin Cai, Vishrav Chaudhary, Congcong Chen, et al. 2025. Phi-4-mini technical report: Compact yet powerful multimodal language models via mixture-of-loras. arXiv preprint arXiv:2503.01743

  5. [5]

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774

  6. [6]

    Ibrahim Said Ahmad, Antonios Anastasopoulos, Ond r ej Bojar, Claudia Borg, Marine Carpuat, Roldano Cattoni, Mauro Cettolo, William Chen, Qianqian Dong, Marcello Federico, Barry Haddow, D \'a vid Javorsk \'y , Mateusz Krubi \'n ski, Tsz Kin Lam, Xutai Ma, Prashant Mathur, Evgeny Matusov, Chandresh Maurya, John McCrae, Kenton Murray, Satoshi Nakamura, Matte...

  7. [7]

    Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebron, and Sumit Sanghai. 2023. https://doi.org/10.18653/v1/2023.emnlp-main.298 GQA : Training generalized multi-query transformer models from multi-head checkpoints . In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 4895--4901. As...

  8. [8]

    Duarte Miguel Alves, Jos \'e Pombal, Nuno M Guerreiro, Pedro Henrique Martins, Jo \ a o Alves, Amin Farajian, Ben Peters, Ricardo Rei, Patrick Fernandes, Sweta Agrawal, Pierre Colombo, Jos \'e G. C. de Souza, and Andre Martins. 2024. https://openreview.net/forum?id=EHPns3hVkj Tower: An open multilingual large language model for translation-related tasks ....

  9. [9]

    Kshitij Ambilduke, Ben Peters, Sonal Sannigrahi, Anil Keshwani, Tsz Kin Lam, Bruno Martins, André F. T. Martins, and Marcely Zanon Boito. 2025. http://arxiv.org/abs/2503.10620 From tower to spire: Adding the speech modality to a text-only llm

  10. [10]

    Ebrahim Ansari, Amittai Axelrod, Nguyen Bach, Ond r ej Bojar, Roldano Cattoni, Fahim Dalvi, Nadir Durrani, Marcello Federico, Christian Federmann, Jiatao Gu, Fei Huang, Kevin Knight, Xutai Ma, Ajay Nagesh, Matteo Negri, Jan Niehues, Juan Pino, Elizabeth Salesky, Xing Shi, Sebastian St \"u ker, Marco Turchi, Alexander Waibel, and Changhan Wang. 2020. https...

  11. [11]

    Mohamed Anwar, Bowen Shi, Vedanuj Goswami, Wei-Ning Hsu, Juan Pino 0001, and Changhan Wang. 2023. https://doi.org/10.21437/Interspeech.2023-2279 Muavic: A multilingual audio-visual corpus for robust speech recognition and robust speech-to-text translation . In 24th Annual Conference of the International Speech Communication Association, Interspeech 2023, ...

  12. [12]

    Rosana Ardila, Megan Branson, Kelly Davis, Michael Kohler, Josh Meyer, Michael Henretty, Reuben Morais, Lindsay Saunders, Francis Tyers, and Gregor Weber. 2020. https://aclanthology.org/2020.lrec-1.520/ Common voice: A massively-multilingual speech corpus . In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 4218--4222. Europ...

  13. [13]

    Giuseppe Attanasio, Beatrice Savoldi, Dennis Fucci, and Dirk Hovy. 2024. https://doi.org/10.18653/v1/2024.emnlp-main.1188 Twists, humps, and pebbles: Multilingual speech recognition models exhibit gender performance gaps . In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 21318--21340. Association for Computa...

  14. [14]

    Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. 2020. wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems, 33:12449--12460

  15. [15]

    Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. 2023. Qwen technical report. arXiv preprint arXiv:2309.16609

  16. [16]

    Ankur Bapna, Colin Cherry, Yu Zhang, Ye Jia, Melvin Johnson, Yong Cheng, Simran Khanuja, Jason Riesa, and Alexis Conneau. 2022. mslam: Massively multilingual joint pre-training for speech and text. arXiv preprint arXiv:2202.01374

  17. [17]

    Lo \" c Barrault, Yu-An Chung, Mariano Cora Meglioli, David Dale, Ning Dong, Paul-Ambroise Duquenne, Hady Elsahar, Hongyu Gong, Kevin Heffernan, John Hoffman, et al. 2023. Seamlessm4t: massively multilingual & multimodal machine translation. arXiv preprint arXiv:2308.11596

  18. [18]

    Rachel Bawden and Fran c ois Yvon. 2023. https://aclanthology.org/2023.eamt-1.16/ Investigating the translation performance of a large multilingual language model: the case of BLOOM . In Proceedings of the 24th Annual Conference of the European Association for Machine Translation, pages 157--170. European Association for Machine Translation

  19. [19]

    Luisa Bentivogli, Mauro Cettolo, Marco Gaido, Alina Karakanta, Alberto Martinelli, Matteo Negri, and Marco Turchi. 2021. https://doi.org/10.18653/v1/2021.acl-long.224 Cascade versus direct speech translation: Do the differences still make a difference? In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th ...

  20. [20]

    Alexandre B \'e rard, Olivier Pietquin, Christophe Servan, and Laurent Besacier. 2016. Listen and translate: A proof of concept for end-to-end speech-to-text translation. arXiv preprint arXiv:1612.01744

  21. [21]

    Yunfei Chu, Jin Xu, Qian Yang, Haojie Wei, Xipin Wei, Zhifang Guo, Yichong Leng, Yuanjun Lv, Jinzheng He, Junyang Lin, et al. 2024. Qwen2-audio technical report. arXiv preprint arXiv:2407.10759

  22. [22]

    Alexis Conneau, Min Ma, Simran Khanuja, Yu Zhang, Vera Axelrod, Siddharth Dalmia, Jason Riesa, Clara Rivera, and Ankur Bapna. 2022. https://arxiv.org/abs/2205.12446 Fleurs: Few-shot learning evaluation of universal representations of speech . arXiv preprint arXiv:2205.12446

  23. [23]

    Costa-juss \`a , Christine Basta, and Gerard I

    Marta R. Costa-juss \`a , Christine Basta, and Gerard I. G \'a llego. 2022. https://aclanthology.org/2022.lrec-1.230/ Evaluating gender bias in speech translation . In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 2141--2147. European Language Resources Association

  24. [24]

    Marta R Costa-Juss \`a , James Cross, Onur C elebi, Maha Elbayad, Kenneth Heafield, Kevin Heffernan, Elahe Kalbassi, Janice Lam, Daniel Licht, Jean Maillard, et al. 2022. No language left behind: Scaling human-centered machine translation. arXiv preprint arXiv:2207.04672

  25. [25]

    John Dang, Shivalika Singh, Daniel D'souza, Arash Ahmadian, Alejandro Salamanca, Madeline Smith, Aidan Peppin, Sungjin Hong, Manoj Govindassamy, Terrence Zhao, Sandra Kublik, Meor Amer, Viraat Aryabumi, Jon Ander Campos, Yi-Chern Tan, Tom Kocmi, Florian Strub, Nathan Grinsztajn, Yannis Flet-Berliac, Acyr Locatelli, Hangyu Lin, Dwarak Talupuru, Bharat Venk...

  26. [26]

    Daniel Deutsch, Eleftheria Briakou, Isaac Rayburn Caswell, Mara Finkelstein, Rebecca Galor, Juraj Juraska, Geza Kovacs, Alison Lui, Ricardo Rei, Jason Riesa, Shruti Rijhwani, Parker Riley, Elizabeth Salesky, Firas Trabelsi, Stephanie Winkler, Biao Zhang, and Markus Freitag. 2025. https://doi.org/10.18653/v1/2025.findings-acl.634 WMT 24++: Expanding the la...

  27. [28]

    Markus Freitag, Nitika Mathur, Chi-kiu Lo, Eleftherios Avramidis, Ricardo Rei, Brian Thompson, Tom Kocmi, Frederic Blain, Daniel Deutsch, Craig Stewart, Chrysoula Zerva, Sheila Castilho, Alon Lavie, and George Foster. 2023. https://doi.org/10.18653/v1/2023.wmt-1.51 Results of WMT 23 metrics shared task: Metrics might be guilty but references are not innoc...

  28. [29]

    Marco Gaido, Sara Papi, Matteo Negri, and Luisa Bentivogli. 2024. https://doi.org/10.18653/v1/2024.acl-long.789 Speech translation with speech foundation models and large language models: What is there and what is missing? In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14760--14778...

  29. [30]

    Marco Gaido, Susana Rodr \'i guez, Matteo Negri, Luisa Bentivogli, and Marco Turchi. 2021. https://doi.org/10.18653/v1/2021.emnlp-main.128 Is ``moby dick'' a whale or a bird? named entities and terminology in speech translation . In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 1707--1716. Association for Co...

  30. [31]

    Xavier Garcia, Yamini Bansal, Colin Cherry, George Foster, Maxim Krikun, Melvin Johnson, and Orhan Firat. 2023. The unreasonable effectiveness of few-shot learning for machine translation. In International Conference on Machine Learning, pages 10867--10878. PMLR

  31. [32]

    Gemma Team , Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ram \'e , Morgane Rivi \`e re, et al. 2025. Gemma 3 technical report. arXiv preprint arXiv:2503.19786

  32. [34]

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. 2024. The llama 3 herd of models. arXiv preprint arXiv:2407.21783

  33. [36]

    Anmol Gulati, James Qin, Chung-Cheng Chiu, Niki Parmar, Yu Zhang, Jiahui Yu, Wei Han, Shibo Wang, Zhengdong Zhang, Yonghui Wu, and Ruoming Pang. 2020. https://doi.org/10.21437/Interspeech.2020-3015 Conformer: Convolution-augmented transformer for speech recognition . In Interspeech 2020, pages 5036--5040

  34. [37]

    Shangmin Guo, Biao Zhang, Tianlin Liu, Tianqi Liu, Misha Khalman, Felipe Llinares, Alexandre Rame, Thomas Mesnard, Yao Zhao, Bilal Piot, et al. 2024. Direct language model alignment from online ai feedback. arXiv preprint arXiv:2402.04792

  35. [38]

    HyoJung Han, Kevin Duh, and Marine Carpuat. 2024. https://doi.org/10.18653/v1/2024.emnlp-main.1218 S peech QE : Estimating the quality of direct speech translation . In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 21852--21867. Association for Computational Linguistics

  36. [39]

    Zachary Hopton and Eleanor Chodroff. 2025. https://doi.org/10.18653/v1/2025.sigmorphon-main.3 The impact of dialect variation on robust automatic speech recognition for C atalan . In Proceedings of the The 22nd SIGMORPHON workshop on Computational Morphology, Phonology, and Phonetics, pages 23--33. Association for Computational Linguistics

  37. [40]

    Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed. 2021. https://doi.org/10.1109/TASLP.2021.3122291 Hubert: Self-supervised speech representation learning by masked prediction of hidden units . IEEE/ACM Trans. Audio, Speech and Lang. Proc., 29:3451–3460

  38. [41]

    Edward J Hu, yelong shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2022. https://openreview.net/forum?id=nZeVKeeFYf9 Lo RA : Low-rank adaptation of large language models . In International Conference on Learning Representations

  39. [42]

    Sathish Indurthi, Houjeung Han, Nikhil Kumar Lakumarapu, Beomseok Lee, Insoo Chung, Sangha Kim, and Chanwoo Kim. 2020. https://doi.org/10.1109/ICASSP40776.2020.9054759 End-end speech-to-text translation with modality agnostic meta-learning . In ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7904--7908

  40. [43]

    Iranzo-Sánchez , J

    J. Iranzo-Sánchez , J. A. Silvestre-Cerdà , J. Jorge , N. Roselló , A. Giménez , A. Sanchis , J. Civera , and A. Juan . 2020. Europarl-st: A multilingual corpus for speech translation of parliamentary debates. In ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 8229--8233

  41. [44]

    Ye Jia, Yifan Ding, Ankur Bapna, Colin Cherry, Yu Zhang, Alexis Conneau, and Nobu Morioka. 2022 . https://doi.org/ 10.21437/Interspeech.2022-10938 Leveraging unsupervised and weakly-supervised data to improve direct speech-to-speech translation . In Interspeech 2022 , pages 1721--1725

  42. [45]

    Juraj Juraska, Daniel Deutsch, Mara Finkelstein, and Markus Freitag. 2024. https://doi.org/10.18653/v1/2024.wmt-1.35 M etric X -24: The G oogle submission to the WMT 2024 metrics shared task . In Proceedings of the Ninth Conference on Machine Translation, pages 492--504. Association for Computational Linguistics

  43. [46]

    Tom Kocmi, Ekaterina Artemova, Eleftherios Avramidis, Rachel Bawden, Ond r ej Bojar, Konstantin Dranch, Anton Dvorkovich, Sergey Dukanov, Mark Fishel, Markus Freitag, Thamme Gowda, Roman Grundkiewicz, Barry Haddow, Marzena Karpinska, Philipp Koehn, Howard Lakougna, Jessica M. Lundin, Christof Monz, Kenton Murray, Masaaki Nagata, Stefano Perrella, Lorenzo ...

  44. [47]

    Tom Kocmi, Eleftherios Avramidis, Rachel Bawden, Ond r ej Bojar, Anton Dvorkovich, Christian Federmann, Mark Fishel, Markus Freitag, Thamme Gowda, Roman Grundkiewicz, Barry Haddow, Marzena Karpinska, Philipp Koehn, Benjamin Marie, Christof Monz, Kenton Murray, Masaaki Nagata, Martin Popel, Maja Popovi \'c , Mariya Shmatova, Steinth \'o r Steingr \'i msson...

  45. [48]

    Tom Kocmi, Vil \'e m Zouhar, Eleftherios Avramidis, Roman Grundkiewicz, Marzena Karpinska, Maja Popovi \'c , Mrinmaya Sachan, and Mariya Shmatova. 2024 b . https://doi.org/10.18653/v1/2024.wmt-1.131 Error span annotation: A balanced approach for human evaluation of machine translation . In Proceedings of the Ninth Conference on Machine Translation, pages ...

  46. [49]

    Tom Kocmi, Vil \'e m Zouhar, Christian Federmann, and Matt Post. 2024 c . https://doi.org/10.18653/v1/2024.acl-long.110 Navigating the metrics maze: Reconciling score magnitudes and accuracies . In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1999--2014. Association for Computationa...

  47. [50]

    Sai Koneru, Maike Z \"u fle, Thai Binh Nguyen, Seymanur Akti, Jan Niehues, and Alexander Waibel. 2025. https://doi.org/10.18653/v1/2025.iwslt-1.22 KIT ' s offline speech translation and instruction following submission for IWSLT 2025 . In Proceedings of the 22nd International Conference on Spoken Language Translation (IWSLT 2025), pages 232--244. Associat...

  48. [51]

    Hadas Kotek, Rikker Dockum, and David Q. Sun. 2023. https://doi.org/10.1145/3582269.3615599 Gender bias and stereotypes in large language models . In Proceedings of The ACM Collective Intelligence Conference, CI 2023, Delft, Netherlands, November 6-9, 2023 , pages 12--24. ACM

  49. [52]

    Oleksii Kuchaiev, Jason Li, Huyen Nguyen, Oleksii Hrinchuk, Ryan Leary, Boris Ginsburg, Samuel Kriman, Stanislav Beliaev, Vitaly Lavrukhin, Jack Cook, et al. 2019. Nemo: a toolkit for building ai applications using neural modules. arXiv preprint arXiv:1909.09577

  50. [53]

    Taku Kudo and John Richardson. 2018. http://arxiv.org/abs/1808.06226 Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing

  51. [54]

    Siddique Latif, Moazzam Shoukat, Fahad Shamshad, Muhammad Usama, Yi Ren, Heriberto Cuay \'a huitl, Wenwu Wang, Xulong Zhang, Roberto Togneri, Erik Cambria, et al. 2023. Sparks of large audio models: A survey and outlook. arXiv preprint arXiv:2308.12792

  52. [55]

    Alon Lavie, Greg Hanneman, Sweta Agrawal, Diptesh Kanojia, Chi-Kiu Lo, Vil \'e m Zouhar, Frederic Blain, Chrysoula Zerva, Eleftherios Avramidis, Sourabh Deoghare, Archchana Sindhujan, Jiayi Wang, David Ifeoluwa Adelani, Brian Thompson, Tom Kocmi, Markus Freitag, and Daniel Deutsch. 2025. https://doi.org/10.18653/v1/2025.wmt-1.24 Findings of the WMT 25 sha...

  53. [56]

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. 2023. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International conference on machine learning, pages 19730--19742. PMLR

  54. [57]

    Alexander H Liu, Andy Ehrenberg, Andy Lo, Cl \'e ment Denoix, Corentin Barreau, Guillaume Lample, Jean-Malo Delignon, Khyathi Raghavi Chandu, Patrick von Platen, Pavankumar Reddy Muddireddy, et al. 2025. Voxtral. arXiv preprint arXiv:2507.13264

  55. [58]

    Liam Lonergan, Mengjie Qian, Neasa N \' Chiar \' a in, Christer Gobl, and Ailbhe N \' Chasaide. 2023. https://doi.org/10.21437/INTERSPEECH.2023-2528 Towards dialect-inclusive recognition in a low-resource language: Are balanced corpora the answer? In 24th Annual Conference of the International Speech Communication Association, Interspeech 2023, Dublin, Ir...

  56. [59]

    Ke-Han Lu, Zhehuai Chen, Szu-Wei Fu, Chao-Han Huck Yang, Jagadeesh Balam, Boris Ginsburg, Yu-Chiang Frank Wang, and Hung-yi Lee. 2024. Desta2: Developing instruction-following speech language model without speech instruction-tuning data. arXiv preprint arXiv:2409.20007

  57. [60]

    Dominik Mach \'a c ek, Ond r ej Bojar, and Raj Dabre. 2023. https://doi.org/10.18653/v1/2023.iwslt-1.12 MT metrics correlate with human ratings of simultaneous speech translation . In Proceedings of the 20th International Conference on Spoken Language Translation (IWSLT 2023), pages 169--179. Association for Computational Linguistics

  58. [61]

    Evgeny Matusov, Gregor Leusch, Oliver Bender, and Hermann Ney. 2005. https://aclanthology.org/2005.iwslt-1.19/ Evaluating machine translation output with automatic sentence segmentation . In Proceedings of the Second International Workshop on Spoken Language Translation

  59. [62]

    Anna Min, Chenxu Hu, Yi Ren, and Hang Zhao. 2025. When end-to-end is overkill: Rethinking cascaded speech-to-text translation. arXiv preprint arXiv:2502.00377

  60. [63]

    H. Ney. 1999. https://doi.org/10.1109/ICASSP.1999.758176 Speech translation: coupling of recognition and translation . In 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258), volume 1, pages 517--520 vol.1

  61. [64]

    Thai-Son Nguyen, Sebastian Stüker, Jan Niehues, and Alex Waibel. 2020. https://doi.org/10.1109/ICASSP40776.2020.9054130 Improving sequence-to-sequence speech recognition training with on-the-fly data augmentation . In ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7689--7693

  62. [65]

    Tu Anh Nguyen, Wei-Ning Hsu, Antony D'Avirro, Bowen Shi, Itai Gat, Maryam Fazel-Zarani, Tal Remez, Jade Copet, Gabriel Synnaeve, Michael Hassid, Felix Kreuk, Yossi Adi, and Emmanuel Dupoux. 2023. https://doi.org/10.21437/Interspeech.2023-1905 Expresso: A benchmark and analysis of discrete expressive speech resynthesis . In Interspeech 2023, pages 4823--4827

  63. [67]

    Tu Anh Nguyen, Benjamin Muller, Bokai Yu, Marta R. Costa-jussa, Maha Elbayad, Sravya Popuri, Christophe Ropers, Paul-Ambroise Duquenne, Robin Algayres, Ruslan Mavlyutov, Itai Gat, Mary Williamson, Gabriel Synnaeve, Juan Pino, Benoît Sagot, and Emmanuel Dupoux. 2025 b . https://doi.org/10.1162/tacl_a_00728 Spirit-lm: Interleaved spoken and written language...

  64. [68]

    Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. 2015. https://doi.org/10.1109/ICASSP.2015.7178964 Librispeech: An asr corpus based on public domain audio books . In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5206--5210

  65. [70]

    Sara Papi, Peter Pol \'a k, Dominik Mach \'a c ek, and Ond r ej Bojar. 2025 a . https://doi.org/10.1162/tacl\_a\_00740 How ``real'' is your real-time simultaneous speech-to-text translation system? Transactions of the Association for Computational Linguistics, 13:281--313

  66. [71]

    Sara Papi, Maike Züfle, Marco Gaido, Beatrice Savoldi, Danni Liu, Ioannis Douros, Luisa Bentivogli, and Jan Niehues. 2025 b . http://arxiv.org/abs/2507.19634 Mcif: Multimodal crosslingual instruction-following benchmark from scientific talks

  67. [72]

    Yifan Peng, Muhammad Shakeel, Yui Sudo, William Chen, Jinchuan Tian, Chyi-Jiunn Lin, and Shinji Watanabe. 2025. https://doi.org/10.21437/Interspeech.2025-1062 OWSM v4: Improving Open Whisper-Style Speech Models via Data Scaling and Cleaning . In Interspeech 2025 , pages 2225--2229

  68. [73]

    McCarthy, and Deepak Gopinath

    Juan Pino, Liezl Puzon, Jiatao Gu, Xutai Ma, Arya D. McCarthy, and Deepak Gopinath. 2019. https://aclanthology.org/2019.iwslt-1.18/ Harnessing indirect training data for end-to-end automatic speech translation: Tricks of the trade . In Proceedings of the 16th International Conference on Spoken Language Translation. Association for Computational Linguistics

  69. [74]

    Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. 2023. Robust speech recognition via large-scale weak supervision. In International conference on machine learning, pages 28492--28518. PMLR

  70. [75]

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. 2023. Direct preference optimization: Your language model is secretly a reward model. Advances in neural information processing systems, 36:53728--53741

  71. [76]

    Nithin Rao Koluguri , Monica Sekoyan, George Zelenfroynd, Sasha Meister, Shuoyang Ding, Sofia Kostandian, He Huang, Nikolay Karpov, Jagadeesh Balam, Vitaly Lavrukhin, Yifan Peng, Sara Papi, Marco Gaido, Alessio Brutti, and Boris Ginsburg. 2025. https://doi.org/10.21437/Interspeech.2025-540 Granary: Speech Recognition and Translation Dataset in 25 European...

  72. [77]

    Ricardo Rei, Nuno M Guerreiro, Jos \'e Pombal, Jo \ a o Alves, Pedro Teixeirinha, Amin Farajian, and Andr \'e FT Martins. 2025. Tower+: Bridging generality and translation specialization in multilingual llms. arXiv preprint arXiv:2506.17080

  73. [78]

    Dima Rekesh, Samuel Kriman, Somshubra Majumdar, Vahid Noroozi, He Juang, Oleksii Hrinchuk, Ankur Kumar, and Boris Ginsburg. 2023. Fast conformer with linearly scalable attention for efficient speech recognition. 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pages 1--8

  74. [79]

    Elijah Rippeth, Sweta Agrawal, and Marine Carpuat. 2022. https://doi.org/10.18653/v1/2022.iwslt-1.30 Controlling translation formality using pre-trained multilingual language models . In Proceedings of the 19th International Conference on Spoken Language Translation (IWSLT 2022), pages 327--340. Association for Computational Linguistics

  75. [80]

    Paul K Rubenstein, Chulayuth Asawaroengchai, Duc Dung Nguyen, Ankur Bapna, Zal \'a n Borsos, F \'e lix de Chaumont Quitry, Peter Chen, Dalia El Badawy, Wei Han, Eugene Kharitonov, et al. 2023. Audiopalm: A large language model that can speak and listen. arXiv preprint arXiv:2306.12925

  76. [81]

    Elizabeth Salesky, Kareem Darwish, Mohamed Al-Badrashiny, Mona Diab, and Jan Niehues. 2023. https://doi.org/10.18653/v1/2023.iwslt-1.2 Evaluating multilingual speech translation under realistic conditions with resegmentation and terminology . In Proceedings of the 20th International Conference on Spoken Language Translation (IWSLT 2023), pages 62--78. Ass...

  77. [82]

    Mohammad Hossein Sameti, Sepehr Harfi Moridani, Ali Zarean, and Hossein Sameti. 2025. https://doi.org/10.48550/arXiv.2510.09528 Accent-invariant automatic speech recognition via saliency-driven spectrogram masking

  78. [83]

    Gabriele Sarti, Phu Mon Htut, Xing Niu, Benjamin Hsu, Anna Currey, Georgiana Dinu, and Maria Nadejde. 2023. https://doi.org/10.18653/v1/2023.acl-short.126 RAMP : Retrieval and attribute-marking enhanced prompting for attribute-controlled translation . In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Sho...

  79. [84]

    Beatrice Savoldi, Jasmijn Bastings, Luisa Bentivogli, and Eva Vanmassenhove. 2025. https://doi.org/10.1016/J.PATTER.2025.101257 A decade of gender bias in machine translation . Patterns, 6(7):101257

  80. [85]

    Schuller

    Bj\" o rn W. Schuller. 2018. https://doi.org/10.1145/3129340 Speech emotion recognition: two decades in a nutshell, benchmarks, and ongoing trends . Commun. ACM, 61(5):90–99

Showing first 80 references.