Recognition: 2 theorem links
· Lean TheoremHearing to Translate: The Effectiveness of Speech Modality Integration into LLMs
Pith reviewed 2026-05-16 21:45 UTC · model grok-4.3
The pith
Cascaded systems remain the most reliable for speech translation overall, though recent SpeechLLMs can match or outperform them in various settings.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Across this extensive evaluation, we find that cascaded systems remain the most reliable solution overall, but most recent SpeechLLMs can match or even outperform cascades in various settings while SFMs lag behind both, highlighting that integrating an LLM, either within the model or in a pipeline, is essential for high-quality speech translation.
What carries the argument
Large-scale benchmarking of six SpeechLLMs against sixteen cascaded SFM-plus-LLM systems across sixteen benchmarks, thirteen language pairs, and nine conditions including noise and disfluency.
If this is right
- Cascaded systems should be the default choice when reliability across varied inputs is the priority.
- Recent SpeechLLMs become practical alternatives in specific language pairs or speech conditions where they perform on par or better.
- Pure speech foundation models without LLM components are insufficient for competitive translation quality.
- Any effective speech translation approach requires LLM capabilities, either embedded directly or added in a pipeline.
Where Pith is reading between the lines
- Hybrid designs that feed speech features into cascaded LLM pipelines could combine the strengths of both approaches.
- The performance patterns suggest modular cascaded systems may retain an edge for consistency on highly variable real-world audio.
- Similar direct-versus-cascaded comparisons could be run for other speech tasks such as summarization or question answering.
- Practitioners may need condition-specific testing rather than a single architecture choice for production speech translation.
Load-bearing premise
The sixteen benchmarks, thirteen language pairs, nine conditions, and chosen systems adequately represent real-world speech translation challenges and current state-of-the-art methods.
What would settle it
A follow-up study on an expanded set of benchmarks or conditions where cascaded systems consistently outperform all tested SpeechLLMs would undermine the claim that recent SpeechLLMs can match or exceed cascades in various settings.
Figures
read the original abstract
As Large Language Models (LLMs) expand beyond text, integrating speech as a native modality has given rise to SpeechLLMs, which directly process spoken language and enable speech-to-text translation (ST) and other downstream tasks, bypassing traditional transcription-based pipelines. Whether this integration improves ST quality over established cascaded architectures, however, remains an open question. We present Hearing to Translate, the first comprehensive test suite rigorously benchmarking 6 state-of-the-art SpeechLLMs against 16 strong direct and cascade systems that couple leading speech foundation models (SFM), with multilingual LLMs. Our analysis spans 16 benchmarks, 13 language pairs, and 9 challenging conditions, including disfluent, noisy, and long-form speech. Across this extensive evaluation, we find that cascaded systems remain the most reliable solution overall, but most recent SpeechLLMs can match or even outperform cascades in various settings while SFMs lag behind both, highlighting that integrating an LLM, either within the model or in a pipeline, is essential for high-quality speech translation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper conducts a large-scale empirical benchmarking of 6 state-of-the-art SpeechLLMs against 16 cascaded and direct systems (pairing speech foundation models with multilingual LLMs) for speech-to-text translation. Evaluation spans 16 benchmarks, 13 language pairs, and 9 conditions including disfluent, noisy, and long-form speech. The central finding is that cascaded systems remain most reliable overall, recent SpeechLLMs can match or exceed them in select settings, SFMs lag behind, and LLM integration (either native or in pipeline) is essential for high-quality performance.
Significance. If the comparative rankings hold under scrutiny, the work supplies a valuable, broad empirical reference point for the speech-translation community on when native speech-modality integration in LLMs is advantageous versus when established cascades remain preferable. The explicit coverage of challenging acoustic and linguistic conditions strengthens the practical relevance of the conclusions.
major comments (3)
- [Abstract, §4] Abstract and §4 (Evaluation setup): the headline claim that 'cascaded systems remain the most reliable solution overall' is not accompanied by an explicit aggregate metric (e.g., mean rank or macro-averaged BLEU across all 16 benchmarks and 9 conditions) or by any statistical test establishing significance of the observed differences; without these, the 'most reliable' ranking rests on qualitative summary rather than a verifiable quantitative criterion.
- [§3.2, §4.3] §3.2 (Model selection) and §4.3 (Results): the assertion that the chosen 6 SpeechLLMs and 16 cascade combinations are representative of 'state-of-the-art' is load-bearing for the generalizability of the ranking, yet the paper provides no explicit inclusion criteria, recency cutoff, or ablation showing that omitted recent systems would not alter the relative ordering.
- [§4.1] §4.1 (Benchmarks and conditions): while 16 benchmarks and 9 conditions are enumerated, the manuscript does not report whether the same train/dev/test splits were used uniformly across all systems or whether any benchmark overlap exists that could inflate apparent performance gaps between cascaded and end-to-end models.
minor comments (3)
- [Throughout] Ensure every acronym (SFM, ST, SpeechLLM, etc.) is defined at first use in the main text, not only in the abstract.
- [§4.3] Tables reporting per-condition scores should include standard deviations or confidence intervals so readers can assess stability of the reported differences.
- [§6] The paper would benefit from a short paragraph in the conclusion explicitly listing the precise conditions under which SpeechLLMs outperform cascades, rather than leaving this to the reader to extract from the tables.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below and will incorporate revisions to enhance the quantitative rigor and transparency of the manuscript.
read point-by-point responses
-
Referee: [Abstract, §4] Abstract and §4 (Evaluation setup): the headline claim that 'cascaded systems remain the most reliable solution overall' is not accompanied by an explicit aggregate metric (e.g., mean rank or macro-averaged BLEU across all 16 benchmarks and 9 conditions) or by any statistical test establishing significance of the observed differences; without these, the 'most reliable' ranking rests on qualitative summary rather than a verifiable quantitative criterion.
Authors: We agree that an explicit aggregate metric would make the claim more verifiable. In the revision we will add a table of average ranks across all benchmarks and conditions, along with a brief discussion of why formal significance testing is difficult given benchmark heterogeneity. The conclusion will be qualified accordingly while retaining the core observation that cascades lead in the majority of settings. revision: yes
-
Referee: [§3.2, §4.3] §3.2 (Model selection) and §4.3 (Results): the assertion that the chosen 6 SpeechLLMs and 16 cascade combinations are representative of 'state-of-the-art' is load-bearing for the generalizability of the ranking, yet the paper provides no explicit inclusion criteria, recency cutoff, or ablation showing that omitted recent systems would not alter the relative ordering.
Authors: Selection was based on models published or released in 2024 with publicly reported speech-translation results at the time of experimentation. We will add explicit criteria in §3.2 (post-2023 publication, open weights or API availability, and coverage of both native SpeechLLM and cascade paradigms) and a short limitations paragraph noting that exhaustive ablation of every concurrent system is infeasible but that the chosen set spans the dominant architectural families. revision: yes
-
Referee: [§4.1] §4.1 (Benchmarks and conditions): while 16 benchmarks and 9 conditions are enumerated, the manuscript does not report whether the same train/dev/test splits were used uniformly across all systems or whether any benchmark overlap exists that could inflate apparent performance gaps between cascaded and end-to-end models.
Authors: All models were evaluated zero-shot on the official test partitions of each benchmark; no training or dev-set adaptation was performed for any system. We will insert a clarifying paragraph in §4.1 that lists the exact splits used and confirms the absence of any training-data overlap or leakage that could favor one paradigm. revision: yes
Circularity Check
No significant circularity in empirical benchmarking study
full rationale
This paper is a pure empirical benchmarking study comparing 6 SpeechLLMs against 16 cascade and direct systems across 16 benchmarks, 13 language pairs, and 9 conditions. It contains no derivations, equations, fitted parameters, self-referential definitions, or load-bearing self-citations that reduce claims to inputs by construction. All results derive from direct evaluation of external model outputs on standard benchmarks, with no ansatzes, uniqueness theorems, or renamings of known results. The central claim about reliability of cascades versus SpeechLLMs is supported by the scale of the evaluation itself and does not rely on any internal circular step.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Across this extensive evaluation, we find that cascaded systems remain the most reliable solution overall, but most recent SpeechLLMs can match or even outperform cascades in various settings while SFMs lag behind both
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We present Hearing to Translate, the first comprehensive test suite rigorously benchmarking 6 state-of-the-art SpeechLLMs against 16 strong direct and cascade systems
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
Pearmut: Human Evaluation of Translation Made Trivial
Pearmut is a platform that makes end-to-end human evaluation of translations as easy as automatic metrics by supporting DA, ESA, MQM and features like document context and attention checks.
Reference graph
Works this paper leans on
-
[1]
ENTRY address author booktitle chapter edition editor howpublished institution journal key month note number organization pages publisher school series title type volume year eprint doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRINGS urlintro eprinturl eprintpr...
-
[2]
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...
-
[3]
Idris Abdulmumin, Victor Agostinelli, Tanel Alum \"a e, Antonios Anastasopoulos, Luisa Bentivogli, Ond r ej Bojar, Claudia Borg, Fethi Bougares, Roldano Cattoni, Mauro Cettolo, Lizhong Chen, William Chen, Raj Dabre, Yannick Est \`e ve, Marcello Federico, Mark Fishel, Marco Gaido, D \'a vid Javorsk \'y , Marek Kasztelnik, Fortun \'e Kponou, Mateusz Krubi \...
-
[4]
Abdelrahman Abouelenin, Atabak Ashfaq, Adam Atkinson, Hany Awadalla, Nguyen Bach, Jianmin Bao, Alon Benhaim, Martin Cai, Vishrav Chaudhary, Congcong Chen, et al. 2025. Phi-4-mini technical report: Compact yet powerful multimodal language models via mixture-of-loras. arXiv preprint arXiv:2503.01743
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[5]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[6]
Ibrahim Said Ahmad, Antonios Anastasopoulos, Ond r ej Bojar, Claudia Borg, Marine Carpuat, Roldano Cattoni, Mauro Cettolo, William Chen, Qianqian Dong, Marcello Federico, Barry Haddow, D \'a vid Javorsk \'y , Mateusz Krubi \'n ski, Tsz Kin Lam, Xutai Ma, Prashant Mathur, Evgeny Matusov, Chandresh Maurya, John McCrae, Kenton Murray, Satoshi Nakamura, Matte...
-
[7]
Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebron, and Sumit Sanghai. 2023. https://doi.org/10.18653/v1/2023.emnlp-main.298 GQA : Training generalized multi-query transformer models from multi-head checkpoints . In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 4895--4901. As...
-
[8]
Duarte Miguel Alves, Jos \'e Pombal, Nuno M Guerreiro, Pedro Henrique Martins, Jo \ a o Alves, Amin Farajian, Ben Peters, Ricardo Rei, Patrick Fernandes, Sweta Agrawal, Pierre Colombo, Jos \'e G. C. de Souza, and Andre Martins. 2024. https://openreview.net/forum?id=EHPns3hVkj Tower: An open multilingual large language model for translation-related tasks ....
work page 2024
- [9]
-
[10]
Ebrahim Ansari, Amittai Axelrod, Nguyen Bach, Ond r ej Bojar, Roldano Cattoni, Fahim Dalvi, Nadir Durrani, Marcello Federico, Christian Federmann, Jiatao Gu, Fei Huang, Kevin Knight, Xutai Ma, Ajay Nagesh, Matteo Negri, Jan Niehues, Juan Pino, Elizabeth Salesky, Xing Shi, Sebastian St \"u ker, Marco Turchi, Alexander Waibel, and Changhan Wang. 2020. https...
-
[11]
Mohamed Anwar, Bowen Shi, Vedanuj Goswami, Wei-Ning Hsu, Juan Pino 0001, and Changhan Wang. 2023. https://doi.org/10.21437/Interspeech.2023-2279 Muavic: A multilingual audio-visual corpus for robust speech recognition and robust speech-to-text translation . In 24th Annual Conference of the International Speech Communication Association, Interspeech 2023, ...
-
[12]
Rosana Ardila, Megan Branson, Kelly Davis, Michael Kohler, Josh Meyer, Michael Henretty, Reuben Morais, Lindsay Saunders, Francis Tyers, and Gregor Weber. 2020. https://aclanthology.org/2020.lrec-1.520/ Common voice: A massively-multilingual speech corpus . In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 4218--4222. Europ...
work page 2020
-
[13]
Giuseppe Attanasio, Beatrice Savoldi, Dennis Fucci, and Dirk Hovy. 2024. https://doi.org/10.18653/v1/2024.emnlp-main.1188 Twists, humps, and pebbles: Multilingual speech recognition models exhibit gender performance gaps . In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 21318--21340. Association for Computa...
-
[14]
Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. 2020. wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems, 33:12449--12460
work page 2020
-
[15]
Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. 2023. Qwen technical report. arXiv preprint arXiv:2309.16609
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [16]
- [17]
-
[18]
Rachel Bawden and Fran c ois Yvon. 2023. https://aclanthology.org/2023.eamt-1.16/ Investigating the translation performance of a large multilingual language model: the case of BLOOM . In Proceedings of the 24th Annual Conference of the European Association for Machine Translation, pages 157--170. European Association for Machine Translation
work page 2023
-
[19]
Luisa Bentivogli, Mauro Cettolo, Marco Gaido, Alina Karakanta, Alberto Martinelli, Matteo Negri, and Marco Turchi. 2021. https://doi.org/10.18653/v1/2021.acl-long.224 Cascade versus direct speech translation: Do the differences still make a difference? In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th ...
-
[20]
Alexandre B \'e rard, Olivier Pietquin, Christophe Servan, and Laurent Besacier. 2016. Listen and translate: A proof of concept for end-to-end speech-to-text translation. arXiv preprint arXiv:1612.01744
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[21]
Yunfei Chu, Jin Xu, Qian Yang, Haojie Wei, Xipin Wei, Zhifang Guo, Yichong Leng, Yuanjun Lv, Jinzheng He, Junyang Lin, et al. 2024. Qwen2-audio technical report. arXiv preprint arXiv:2407.10759
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [22]
-
[23]
Costa-juss \`a , Christine Basta, and Gerard I
Marta R. Costa-juss \`a , Christine Basta, and Gerard I. G \'a llego. 2022. https://aclanthology.org/2022.lrec-1.230/ Evaluating gender bias in speech translation . In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 2141--2147. European Language Resources Association
work page 2022
-
[24]
Marta R Costa-Juss \`a , James Cross, Onur C elebi, Maha Elbayad, Kenneth Heafield, Kevin Heffernan, Elahe Kalbassi, Janice Lam, Daniel Licht, Jean Maillard, et al. 2022. No language left behind: Scaling human-centered machine translation. arXiv preprint arXiv:2207.04672
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[25]
John Dang, Shivalika Singh, Daniel D'souza, Arash Ahmadian, Alejandro Salamanca, Madeline Smith, Aidan Peppin, Sungjin Hong, Manoj Govindassamy, Terrence Zhao, Sandra Kublik, Meor Amer, Viraat Aryabumi, Jon Ander Campos, Yi-Chern Tan, Tom Kocmi, Florian Strub, Nathan Grinsztajn, Yannis Flet-Berliac, Acyr Locatelli, Hangyu Lin, Dwarak Talupuru, Bharat Venk...
-
[26]
Daniel Deutsch, Eleftheria Briakou, Isaac Rayburn Caswell, Mara Finkelstein, Rebecca Galor, Juraj Juraska, Geza Kovacs, Alison Lui, Ricardo Rei, Jason Riesa, Shruti Rijhwani, Parker Riley, Elizabeth Salesky, Firas Trabelsi, Stephanie Winkler, Biao Zhang, and Markus Freitag. 2025. https://doi.org/10.18653/v1/2025.findings-acl.634 WMT 24++: Expanding the la...
-
[28]
Markus Freitag, Nitika Mathur, Chi-kiu Lo, Eleftherios Avramidis, Ricardo Rei, Brian Thompson, Tom Kocmi, Frederic Blain, Daniel Deutsch, Craig Stewart, Chrysoula Zerva, Sheila Castilho, Alon Lavie, and George Foster. 2023. https://doi.org/10.18653/v1/2023.wmt-1.51 Results of WMT 23 metrics shared task: Metrics might be guilty but references are not innoc...
-
[29]
Marco Gaido, Sara Papi, Matteo Negri, and Luisa Bentivogli. 2024. https://doi.org/10.18653/v1/2024.acl-long.789 Speech translation with speech foundation models and large language models: What is there and what is missing? In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14760--14778...
-
[30]
Marco Gaido, Susana Rodr \'i guez, Matteo Negri, Luisa Bentivogli, and Marco Turchi. 2021. https://doi.org/10.18653/v1/2021.emnlp-main.128 Is ``moby dick'' a whale or a bird? named entities and terminology in speech translation . In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 1707--1716. Association for Co...
-
[31]
Xavier Garcia, Yamini Bansal, Colin Cherry, George Foster, Maxim Krikun, Melvin Johnson, and Orhan Firat. 2023. The unreasonable effectiveness of few-shot learning for machine translation. In International Conference on Machine Learning, pages 10867--10878. PMLR
work page 2023
-
[32]
Gemma Team , Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ram \'e , Morgane Rivi \`e re, et al. 2025. Gemma 3 technical report. arXiv preprint arXiv:2503.19786
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[34]
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. 2024. The llama 3 herd of models. arXiv preprint arXiv:2407.21783
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[36]
Anmol Gulati, James Qin, Chung-Cheng Chiu, Niki Parmar, Yu Zhang, Jiahui Yu, Wei Han, Shibo Wang, Zhengdong Zhang, Yonghui Wu, and Ruoming Pang. 2020. https://doi.org/10.21437/Interspeech.2020-3015 Conformer: Convolution-augmented transformer for speech recognition . In Interspeech 2020, pages 5036--5040
- [37]
-
[38]
HyoJung Han, Kevin Duh, and Marine Carpuat. 2024. https://doi.org/10.18653/v1/2024.emnlp-main.1218 S peech QE : Estimating the quality of direct speech translation . In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 21852--21867. Association for Computational Linguistics
-
[39]
Zachary Hopton and Eleanor Chodroff. 2025. https://doi.org/10.18653/v1/2025.sigmorphon-main.3 The impact of dialect variation on robust automatic speech recognition for C atalan . In Proceedings of the The 22nd SIGMORPHON workshop on Computational Morphology, Phonology, and Phonetics, pages 23--33. Association for Computational Linguistics
-
[40]
Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed. 2021. https://doi.org/10.1109/TASLP.2021.3122291 Hubert: Self-supervised speech representation learning by masked prediction of hidden units . IEEE/ACM Trans. Audio, Speech and Lang. Proc., 29:3451–3460
-
[41]
Edward J Hu, yelong shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2022. https://openreview.net/forum?id=nZeVKeeFYf9 Lo RA : Low-rank adaptation of large language models . In International Conference on Learning Representations
work page 2022
-
[42]
Sathish Indurthi, Houjeung Han, Nikhil Kumar Lakumarapu, Beomseok Lee, Insoo Chung, Sangha Kim, and Chanwoo Kim. 2020. https://doi.org/10.1109/ICASSP40776.2020.9054759 End-end speech-to-text translation with modality agnostic meta-learning . In ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7904--7908
-
[43]
J. Iranzo-Sánchez , J. A. Silvestre-Cerdà , J. Jorge , N. Roselló , A. Giménez , A. Sanchis , J. Civera , and A. Juan . 2020. Europarl-st: A multilingual corpus for speech translation of parliamentary debates. In ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 8229--8233
work page 2020
-
[44]
Ye Jia, Yifan Ding, Ankur Bapna, Colin Cherry, Yu Zhang, Alexis Conneau, and Nobu Morioka. 2022 . https://doi.org/ 10.21437/Interspeech.2022-10938 Leveraging unsupervised and weakly-supervised data to improve direct speech-to-speech translation . In Interspeech 2022 , pages 1721--1725
-
[45]
Juraj Juraska, Daniel Deutsch, Mara Finkelstein, and Markus Freitag. 2024. https://doi.org/10.18653/v1/2024.wmt-1.35 M etric X -24: The G oogle submission to the WMT 2024 metrics shared task . In Proceedings of the Ninth Conference on Machine Translation, pages 492--504. Association for Computational Linguistics
-
[46]
Tom Kocmi, Ekaterina Artemova, Eleftherios Avramidis, Rachel Bawden, Ond r ej Bojar, Konstantin Dranch, Anton Dvorkovich, Sergey Dukanov, Mark Fishel, Markus Freitag, Thamme Gowda, Roman Grundkiewicz, Barry Haddow, Marzena Karpinska, Philipp Koehn, Howard Lakougna, Jessica M. Lundin, Christof Monz, Kenton Murray, Masaaki Nagata, Stefano Perrella, Lorenzo ...
work page 2025
-
[47]
Tom Kocmi, Eleftherios Avramidis, Rachel Bawden, Ond r ej Bojar, Anton Dvorkovich, Christian Federmann, Mark Fishel, Markus Freitag, Thamme Gowda, Roman Grundkiewicz, Barry Haddow, Marzena Karpinska, Philipp Koehn, Benjamin Marie, Christof Monz, Kenton Murray, Masaaki Nagata, Martin Popel, Maja Popovi \'c , Mariya Shmatova, Steinth \'o r Steingr \'i msson...
-
[48]
Tom Kocmi, Vil \'e m Zouhar, Eleftherios Avramidis, Roman Grundkiewicz, Marzena Karpinska, Maja Popovi \'c , Mrinmaya Sachan, and Mariya Shmatova. 2024 b . https://doi.org/10.18653/v1/2024.wmt-1.131 Error span annotation: A balanced approach for human evaluation of machine translation . In Proceedings of the Ninth Conference on Machine Translation, pages ...
-
[49]
Tom Kocmi, Vil \'e m Zouhar, Christian Federmann, and Matt Post. 2024 c . https://doi.org/10.18653/v1/2024.acl-long.110 Navigating the metrics maze: Reconciling score magnitudes and accuracies . In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1999--2014. Association for Computationa...
-
[50]
Sai Koneru, Maike Z \"u fle, Thai Binh Nguyen, Seymanur Akti, Jan Niehues, and Alexander Waibel. 2025. https://doi.org/10.18653/v1/2025.iwslt-1.22 KIT ' s offline speech translation and instruction following submission for IWSLT 2025 . In Proceedings of the 22nd International Conference on Spoken Language Translation (IWSLT 2025), pages 232--244. Associat...
-
[51]
Hadas Kotek, Rikker Dockum, and David Q. Sun. 2023. https://doi.org/10.1145/3582269.3615599 Gender bias and stereotypes in large language models . In Proceedings of The ACM Collective Intelligence Conference, CI 2023, Delft, Netherlands, November 6-9, 2023 , pages 12--24. ACM
- [52]
-
[53]
Taku Kudo and John Richardson. 2018. http://arxiv.org/abs/1808.06226 Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing
work page internal anchor Pith review Pith/arXiv arXiv 2018
- [54]
-
[55]
Alon Lavie, Greg Hanneman, Sweta Agrawal, Diptesh Kanojia, Chi-Kiu Lo, Vil \'e m Zouhar, Frederic Blain, Chrysoula Zerva, Eleftherios Avramidis, Sourabh Deoghare, Archchana Sindhujan, Jiayi Wang, David Ifeoluwa Adelani, Brian Thompson, Tom Kocmi, Markus Freitag, and Daniel Deutsch. 2025. https://doi.org/10.18653/v1/2025.wmt-1.24 Findings of the WMT 25 sha...
-
[56]
Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. 2023. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International conference on machine learning, pages 19730--19742. PMLR
work page 2023
- [57]
-
[58]
Liam Lonergan, Mengjie Qian, Neasa N \' Chiar \' a in, Christer Gobl, and Ailbhe N \' Chasaide. 2023. https://doi.org/10.21437/INTERSPEECH.2023-2528 Towards dialect-inclusive recognition in a low-resource language: Are balanced corpora the answer? In 24th Annual Conference of the International Speech Communication Association, Interspeech 2023, Dublin, Ir...
- [59]
-
[60]
Dominik Mach \'a c ek, Ond r ej Bojar, and Raj Dabre. 2023. https://doi.org/10.18653/v1/2023.iwslt-1.12 MT metrics correlate with human ratings of simultaneous speech translation . In Proceedings of the 20th International Conference on Spoken Language Translation (IWSLT 2023), pages 169--179. Association for Computational Linguistics
-
[61]
Evgeny Matusov, Gregor Leusch, Oliver Bender, and Hermann Ney. 2005. https://aclanthology.org/2005.iwslt-1.19/ Evaluating machine translation output with automatic sentence segmentation . In Proceedings of the Second International Workshop on Spoken Language Translation
work page 2005
- [62]
-
[63]
H. Ney. 1999. https://doi.org/10.1109/ICASSP.1999.758176 Speech translation: coupling of recognition and translation . In 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258), volume 1, pages 517--520 vol.1
-
[64]
Thai-Son Nguyen, Sebastian Stüker, Jan Niehues, and Alex Waibel. 2020. https://doi.org/10.1109/ICASSP40776.2020.9054130 Improving sequence-to-sequence speech recognition training with on-the-fly data augmentation . In ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7689--7693
-
[65]
Tu Anh Nguyen, Wei-Ning Hsu, Antony D'Avirro, Bowen Shi, Itai Gat, Maryam Fazel-Zarani, Tal Remez, Jade Copet, Gabriel Synnaeve, Michael Hassid, Felix Kreuk, Yossi Adi, and Emmanuel Dupoux. 2023. https://doi.org/10.21437/Interspeech.2023-1905 Expresso: A benchmark and analysis of discrete expressive speech resynthesis . In Interspeech 2023, pages 4823--4827
-
[67]
Tu Anh Nguyen, Benjamin Muller, Bokai Yu, Marta R. Costa-jussa, Maha Elbayad, Sravya Popuri, Christophe Ropers, Paul-Ambroise Duquenne, Robin Algayres, Ruslan Mavlyutov, Itai Gat, Mary Williamson, Gabriel Synnaeve, Juan Pino, Benoît Sagot, and Emmanuel Dupoux. 2025 b . https://doi.org/10.1162/tacl_a_00728 Spirit-lm: Interleaved spoken and written language...
-
[68]
Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. 2015. https://doi.org/10.1109/ICASSP.2015.7178964 Librispeech: An asr corpus based on public domain audio books . In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5206--5210
-
[70]
Sara Papi, Peter Pol \'a k, Dominik Mach \'a c ek, and Ond r ej Bojar. 2025 a . https://doi.org/10.1162/tacl\_a\_00740 How ``real'' is your real-time simultaneous speech-to-text translation system? Transactions of the Association for Computational Linguistics, 13:281--313
work page internal anchor Pith review doi:10.1162/tacl 2025
- [71]
-
[72]
Yifan Peng, Muhammad Shakeel, Yui Sudo, William Chen, Jinchuan Tian, Chyi-Jiunn Lin, and Shinji Watanabe. 2025. https://doi.org/10.21437/Interspeech.2025-1062 OWSM v4: Improving Open Whisper-Style Speech Models via Data Scaling and Cleaning . In Interspeech 2025 , pages 2225--2229
-
[73]
Juan Pino, Liezl Puzon, Jiatao Gu, Xutai Ma, Arya D. McCarthy, and Deepak Gopinath. 2019. https://aclanthology.org/2019.iwslt-1.18/ Harnessing indirect training data for end-to-end automatic speech translation: Tricks of the trade . In Proceedings of the 16th International Conference on Spoken Language Translation. Association for Computational Linguistics
work page 2019
-
[74]
Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. 2023. Robust speech recognition via large-scale weak supervision. In International conference on machine learning, pages 28492--28518. PMLR
work page 2023
-
[75]
Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. 2023. Direct preference optimization: Your language model is secretly a reward model. Advances in neural information processing systems, 36:53728--53741
work page 2023
-
[76]
Nithin Rao Koluguri , Monica Sekoyan, George Zelenfroynd, Sasha Meister, Shuoyang Ding, Sofia Kostandian, He Huang, Nikolay Karpov, Jagadeesh Balam, Vitaly Lavrukhin, Yifan Peng, Sara Papi, Marco Gaido, Alessio Brutti, and Boris Ginsburg. 2025. https://doi.org/10.21437/Interspeech.2025-540 Granary: Speech Recognition and Translation Dataset in 25 European...
- [77]
-
[78]
Dima Rekesh, Samuel Kriman, Somshubra Majumdar, Vahid Noroozi, He Juang, Oleksii Hrinchuk, Ankur Kumar, and Boris Ginsburg. 2023. Fast conformer with linearly scalable attention for efficient speech recognition. 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pages 1--8
work page 2023
-
[79]
Elijah Rippeth, Sweta Agrawal, and Marine Carpuat. 2022. https://doi.org/10.18653/v1/2022.iwslt-1.30 Controlling translation formality using pre-trained multilingual language models . In Proceedings of the 19th International Conference on Spoken Language Translation (IWSLT 2022), pages 327--340. Association for Computational Linguistics
-
[80]
Paul K Rubenstein, Chulayuth Asawaroengchai, Duc Dung Nguyen, Ankur Bapna, Zal \'a n Borsos, F \'e lix de Chaumont Quitry, Peter Chen, Dalia El Badawy, Wei Han, Eugene Kharitonov, et al. 2023. Audiopalm: A large language model that can speak and listen. arXiv preprint arXiv:2306.12925
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[81]
Elizabeth Salesky, Kareem Darwish, Mohamed Al-Badrashiny, Mona Diab, and Jan Niehues. 2023. https://doi.org/10.18653/v1/2023.iwslt-1.2 Evaluating multilingual speech translation under realistic conditions with resegmentation and terminology . In Proceedings of the 20th International Conference on Spoken Language Translation (IWSLT 2023), pages 62--78. Ass...
-
[82]
Mohammad Hossein Sameti, Sepehr Harfi Moridani, Ali Zarean, and Hossein Sameti. 2025. https://doi.org/10.48550/arXiv.2510.09528 Accent-invariant automatic speech recognition via saliency-driven spectrogram masking
-
[83]
Gabriele Sarti, Phu Mon Htut, Xing Niu, Benjamin Hsu, Anna Currey, Georgiana Dinu, and Maria Nadejde. 2023. https://doi.org/10.18653/v1/2023.acl-short.126 RAMP : Retrieval and attribute-marking enhanced prompting for attribute-controlled translation . In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Sho...
-
[84]
Beatrice Savoldi, Jasmijn Bastings, Luisa Bentivogli, and Eva Vanmassenhove. 2025. https://doi.org/10.1016/J.PATTER.2025.101257 A decade of gender bias in machine translation . Patterns, 6(7):101257
-
[85]
Bj\" o rn W. Schuller. 2018. https://doi.org/10.1145/3129340 Speech emotion recognition: two decades in a nutshell, benchmarks, and ongoing trends . Commun. ACM, 61(5):90–99
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.