Multilingual Long-Form Speech Instruction Following: KIT's Submission to IWSLT 2026

Alexander Waibel; Enes Yavuz Ugan; Fabian Retkowski; Jan Niehues; Maike Z\"ufle; Seymanur Akti; Supriti Sinhamahapatra; Yuka Ko

arxiv: 2606.04730 · v1 · pith:YYWTWFFDnew · submitted 2026-06-03 · 💻 cs.CL · eess.AS

Multilingual Long-Form Speech Instruction Following: KIT's Submission to IWSLT 2026

Enes Yavuz Ugan , Maike Z\"ufle , Yuka Ko , Supriti Sinhamahapatra , Fabian Retkowski , Seymanur Akti , Jan Niehues , Alexander Waibel This is my paper

Pith reviewed 2026-06-28 06:19 UTC · model grok-4.3

classification 💻 cs.CL eess.AS

keywords long-form speechinstruction followingdata augmentationminimum Bayes risklikelihood re-rankingmultilingual ASRsemantic tasksIWSLT

0 comments

The pith

Likelihood re-ranking degrades semantic performance in long-form speech by favoring segmented-audio candidates, but Minimum Bayes Risk decoding corrects the failure.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper describes a data augmentation pipeline that turns short-form speech corpora into over one million long-form training examples by concatenating segments, generating labels with an LLM, and translating across languages to cover six tasks in four languages. It then demonstrates that pure likelihood-based re-ranking, which works well for ASR, harms semantic instruction-following tasks because it prefers candidates produced from short audio segments rather than from full long-form processing. Combining the likelihood scores with Minimum Bayes Risk decoding avoids this selection bias and restores performance on the semantic tasks. The work is presented as KIT's entry in the unconstrained track of the IWSLT 2026 Instruction Following evaluation, which includes an unknown surprise task.

Core claim

A general data augmentation pipeline converts short-form corpora into long-form training data through segment concatenation, LLM-based label generation, and cross-lingual translation, yielding over 1M instances across six tasks and four languages. Likelihood-based re-ranking, while highly effective for ASR, systematically degrades semantic tasks by spuriously selecting candidates generated from segmented audio processing rather than holistic long-form inference; this failure mode is resolved by combining likelihood with Minimum Bayes Risk decoding.

What carries the argument

Likelihood re-ranking combined with Minimum Bayes Risk decoding, which prevents selection of segmented-audio candidates on semantic tasks.

If this is right

The same augmentation pipeline can be applied to additional languages and tasks without task-specific redesign.
Minimum Bayes Risk decoding becomes necessary whenever re-ranking is used on semantic rather than purely acoustic objectives.
Systems can be prepared for unknown surprise tasks by training only on the generated long-form mixture rather than on known task distributions.
The observed failure of likelihood re-ranking is specific to long-form inputs and does not appear in short-form ASR settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same selection bias may appear in any pipeline that mixes short-segment and full-context decoding paths, suggesting a general need for risk-based decoding in long-context speech systems.
If the augmented data distribution proves sufficiently close to real long-form speech, the approach could reduce reliance on scarce naturally occurring long recordings for training.
The result implies that future instruction-following benchmarks should include explicit long-form audio to expose re-ranking artifacts that short-form tests miss.

Load-bearing premise

Converting short-form corpora into long-form training data via segment concatenation, LLM-based label generation, and cross-lingual translation produces examples whose distribution is close enough to real long-form inputs that models trained on them will generalize to genuine long-form instruction following.

What would settle it

A controlled test on held-out real long-form audio showing that pure likelihood re-ranking selects segmented candidates at the same or lower rate than holistic candidates on semantic tasks, or that the augmented training data yields no gain over short-form-only training on genuine long-form inputs.

read the original abstract

With the advent of Large Language Models, single-task and token-based multi-task models have evolved into instruction-based systems that infer task and target language implicitly from natural language prompts. This trend is reflected in IWSLT's Instruction Following Track, which this year introduced new tasks including an unknown surprise task, posing a genuine challenge against overfitting to known tasks. We present KIT's submission to the Long and Short Instruction Following tracks in the unconstrained setting. Our approach combines a general data augmentation pipeline that converts short-form corpora into long-form training data through segment concatenation, LLM-based label generation, and cross-lingual translation, yielding over 1M instances across six tasks and four languages. We further show that likelihood-based re-ranking, while highly effective for ASR, systematically degrades semantic tasks by spuriously selecting candidates generated from segmented audio processing rather than holistic long-form inference, a failure mode resolved by combining likelihood with Minimum Bayes Risk decoding.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a standard IWSLT system paper that flags a re-ranking pitfall on semantic tasks but leaves its main data-augmentation claim untested.

read the letter

The punchline is that this describes an engineering submission rather than a new scientific result. The authors scale short-form data into over a million long-form examples via segment concatenation, LLM label generation, and cross-lingual translation, then report that likelihood re-ranking hurts semantic instruction following because it prefers segmented-audio candidates over holistic ones; they mitigate it by combining likelihood with Minimum Bayes Risk decoding.

The re-ranking observation is the clearest practical takeaway. It directly addresses a failure mode that matters for anyone running beam search or n-best lists on long speech inputs, and the MBR fix is a straightforward, reproducible adjustment.

The data-augmentation pipeline itself follows routine short-to-long techniques already used in ASR and translation. What is missing is any check that the synthetic distribution matches real long-form test data. No length histograms, acoustic coherence measures, or semantic divergence stats are described, so it remains unclear whether models trained on the augmented set actually generalize or simply benefit from volume.

The paper is aimed at other IWSLT participants and groups building multilingual speech instruction systems. Readers looking for decoding heuristics may find the re-ranking note useful; those seeking validated methods for long-context speech will not.

It is worth sending to peer review for the shared-task proceedings because the decoding point is concrete and the scale of the data effort is documented, even though the central generalization claim rests on an unverified assumption.

Referee Report

2 major / 1 minor

Summary. The manuscript describes KIT's unconstrained submission to the IWSLT 2026 Long and Short Instruction Following tracks. It presents a data augmentation pipeline that converts short-form corpora into over 1M long-form training instances across six tasks and four languages via segment concatenation, LLM-based label generation, and cross-lingual translation. The authors identify that likelihood-based re-ranking, while effective for ASR, degrades semantic tasks by favoring candidates from segmented audio processing over holistic long-form inference, and propose resolving this by combining likelihood scores with Minimum Bayes Risk decoding.

Significance. If the empirical results hold, the work provides a practical, scalable method for creating long-form speech instruction data and isolates a concrete failure mode in likelihood re-ranking for semantic tasks together with a mitigation strategy. This could inform system design for multilingual long-form speech processing. The paper is credited for its focus on an unknown surprise task and for explicitly documenting the re-ranking artifact rather than treating it as a black-box hyperparameter.

major comments (2)

[Data Augmentation Pipeline] Data Augmentation section: The claim that the >1M synthetic examples are distributionally close enough to genuine long-form inputs for models to generalize rests on an untested assumption. No quantitative comparisons (length histograms, instruction complexity metrics, acoustic coherence, or semantic divergence scores) are supplied between the concatenated/translated set and the IWSLT long-form test distribution, which is load-bearing for both baseline performance and the reported re-ranking/MBR contrast.
[Re-ranking Analysis] Re-ranking and MBR section: The central claim that likelihood re-ranking systematically degrades semantic tasks (while MBR combination resolves it) is stated without any supporting numbers, ablation tables, or error analysis. This absence prevents verification of the failure mode and the proposed fix.

minor comments (1)

[Abstract] The abstract would be strengthened by including at least one key quantitative result (e.g., WER or semantic accuracy delta) so that the re-ranking observation can be assessed without reading the full text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our IWSLT 2026 submission. We address the two major comments point by point below and will revise the manuscript to include the requested analyses.

read point-by-point responses

Referee: [Data Augmentation Pipeline] Data Augmentation section: The claim that the >1M synthetic examples are distributionally close enough to genuine long-form inputs for models to generalize rests on an untested assumption. No quantitative comparisons (length histograms, instruction complexity metrics, acoustic coherence, or semantic divergence scores) are supplied between the concatenated/translated set and the IWSLT long-form test distribution, which is load-bearing for both baseline performance and the reported re-ranking/MBR contrast.

Authors: We agree that quantitative comparisons would strengthen the distributional similarity claim. In the revised manuscript we will add length histograms, average instruction lengths, and semantic similarity scores (via multilingual sentence embeddings) between the augmented training set and the IWSLT long-form test distribution. revision: yes
Referee: [Re-ranking Analysis] Re-ranking and MBR section: The central claim that likelihood re-ranking systematically degrades semantic tasks (while MBR combination resolves it) is stated without any supporting numbers, ablation tables, or error analysis. This absence prevents verification of the failure mode and the proposed fix.

Authors: We acknowledge the absence of explicit numbers and ablations. The revised version will include a performance table for semantic tasks comparing likelihood re-ranking alone versus likelihood combined with MBR, plus a short error analysis with illustrative examples of the failure mode. revision: yes

Circularity Check

0 steps flagged

Empirical system description with no derivation chain or self-referential predictions

full rationale

The paper is a competition submission describing a data augmentation pipeline (segment concatenation, LLM label generation, cross-lingual translation) and a decoding combination (likelihood + MBR). No equations, fitted parameters presented as predictions, or load-bearing self-citations appear. Claims rest on empirical IWSLT results rather than any reduction to inputs by construction. This is the most common honest finding for system papers.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract mentions no free parameters, mathematical axioms, or newly postulated entities; the work relies on standard machine-learning components whose details are not supplied.

pith-pipeline@v0.9.1-grok · 5725 in / 1294 out tokens · 31996 ms · 2026-06-28T06:19:44.994607+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

30 extracted references · 14 canonical work pages · 5 internal anchors

[1]

Phi-4 technical re- port.arXiv preprint arXiv:2412.08905. David Ifeoluwa Adelani, Antonios Anastasopoulos, Vic- tor Agostinelli, Luisa Bentivogli, Ondˇrej Bojar, Se- bastien Bratières, Marine Carpuat, Roldano Cattoni, Mauro Cettolo, Lizhong Chen, Marcello Federico, Marco Gaido, Mahendra Gupta, HyoJung Han, Ali Hatami, David Javorský, Yejin Jeon, Marek Kas...

work page internal anchor Pith review Pith/arXiv arXiv
[2]

InProceedings of the 23rd International Conference on Spoken Language Translation (IWSLT 2026), San Diego, California, US

Speech trans- lation and metrics in 2026: Findings of the iwslt campaign. InProceedings of the 23rd International Conference on Spoken Language Translation (IWSLT 2026), San Diego, California, US. Association for Computational Linguistics. Duarte M. Alves, José Pombal, Nuno M. Guerreiro, Pe- dro H. Martins, João Alves, Amin Farajian, Ben Pe- ters, Ricardo...

2026
[3]

Preprint, arXiv:2402.17733

Tower: An open multilingual large language model for translation-related tasks. Preprint, arXiv:2402.17733. Loïc Barrault, Yu-An Chung, Mariano Cora Meglioli, David Dale, Ning Dong, Paul-Ambroise Duquenne, Hady Elsahar, Hongyu Gong, Kevin Heffernan, John Hoffman, and 1 others

work page arXiv
[4]

arXiv preprint arXiv:2308.11596 , year=

Seamlessm4t: Mas- sively multilingual & multimodal machine transla- tion.arXiv preprint arXiv:2308.11596. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova

work page arXiv
[5]

Bert: Pre-training of deep bidirectional transformers for language understand- ing. InProceedings of the 2019 conference of the North American chapter of the association for com- putational linguistics: human language technologies, volume 1 (long and short papers), pages 4171–4186. Mara Finkelstein, Isaac Caswell, Tobias Domhan, Jan- Thorsten Peter, Juraj...

2019
[6]

arXiv preprint arXiv:2601.09012

Translategemma technical report. arXiv preprint arXiv:2601.09012. Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, and 1 others

work page arXiv
[7]

InProceed- ings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstra- tions, pages 12–20

End-to-end evaluation for low- latency simultaneous speech translation. InProceed- ings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstra- tions, pages 12–20. Javier Iranzo-Sánchez, Joan Albert Silvestre-Cerdà, Javier Jorge, Nahuel Roselló, Adrià Giménez, Al- bert Sanchis, Jorge Civera, and Alfons Juan

2023
[8]

InICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 8229–

Europarl-st: A multilingual corpus for speech trans- lation of parliamentary debates. InICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 8229–

2020
[9]

Llama-3.1-foundationai-securityllm-base-8b technical report.arXiv preprint arXiv:2504.21039, 2025

Llama-3.1-foundationai- securityllm-base-8b technical report.arXiv preprint arXiv:2504.21039. Sai Koneru, Fabian Retkowski, Christian Huber, Lukas Hilgert, Seymanur Akti, Enes Yavuz Ugan, Alex Waibel, and Jan Niehues

work page arXiv
[10]

Oleksii Kuchaiev, Jason Li, Huyen Nguyen, Oleksii Hrinchuk, Ryan Leary, Boris Ginsburg, Samuel Kri- man, Stanislav Beliaev, Vitaly Lavrukhin, Jack Cook, and 1 others

InProceedings of the 22nd International Conference on Spoken Language Translation (IWSLT 2025), pages 232–244. Oleksii Kuchaiev, Jason Li, Huyen Nguyen, Oleksii Hrinchuk, Ryan Leary, Boris Ginsburg, Samuel Kri- man, Stanislav Beliaev, Vitaly Lavrukhin, Jack Cook, and 1 others

2025
[11]

Nemo: a toolkit for building ai applications using neural modules,

Nemo: a toolkit for building ai applications using neural modules.arXiv preprint arXiv:1909.09577. Shankar Kumar and William Byrne

work page arXiv 1909
[12]

Minimum Bayes-risk decoding for statistical machine transla- tion. InProceedings of the Human Language Tech- nology Conference of the North American Chapter of the Association for Computational Linguistics: HLT-NAACL 2004, pages 169–176, Boston, Mas- sachusetts, USA. Association for Computational Lin- guistics. Lujun Li, Lama Sleem, Geoffrey Nichil, Radu ...

2004
[13]

arXiv preprint arXiv:2105.03010

Efficient weight factorization for multilingual speech recognition. arXiv preprint arXiv:2105.03010. Alec Radford, Jong Wook Kim, Tao Xu, Greg Brock- man, Christine McLeavey, and Ilya Sutskever

work page arXiv
[14]

Robust Speech Recognition via Large-Scale Weak Supervision

Robust speech recognition via large-scale weak su- pervision.Preprint, arXiv:2212.04356. Ricardo Rei, Marcos Treviso, Nuno M Guerreiro, Chrysoula Zerva, Ana C Farinha, Christine Maroti, José GC De Souza, Taisiya Glushkova, Duarte Alves, Luisa Coheur, and 1 others

work page internal anchor Pith review Pith/arXiv arXiv
[15]

InProceedings of the Seventh Confer- ence on Machine Translation (WMT), pages 634–

Cometkiwi: Ist- unbabel 2022 submission for the quality estimation shared task. InProceedings of the Seventh Confer- ence on Machine Translation (WMT), pages 634–

2022
[16]

InProceedings of the 2025 Con- ference on Empirical Methods in Natural Language Processing, pages 27275–27306, Suzhou, China

Summarizing speech: A compre- hensive survey. InProceedings of the 2025 Con- ference on Empirical Methods in Natural Language Processing, pages 27275–27306, Suzhou, China. As- sociation for Computational Linguistics. Fabian Retkowski, Maike Züfle, Thai Binh Nguyen, Jan Niehues, and Alexander Waibel

2025
[17]

Beyond Transcripts: A Renewed Perspective on Audio Chaptering

Beyond tran- scripts: A renewed perspective on audio chaptering. Preprint, arXiv:2602.08979. Supriti Sinhamahapatra and Jan Niehues

work page internal anchor Pith review Pith/arXiv arXiv
[18]

InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 16111–16121

Do slides help? multi-modal context for automatic tran- scription of conference talks. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 16111–16121. Changli Tang, Wenyi Yu, Guangzhi Sun, Xianzhao Chen, Tian Tan, Wei Li, Lu Lu, Zejun Ma, and Chao Zhang

2025
[19]

InThe Twelfth International Conference on Learning Representa- tions, ICLR 2024, Vienna, Austria, May 7-11,

SALMONN: towards generic hearing abilities for large language models. InThe Twelfth International Conference on Learning Representa- tions, ICLR 2024, Vienna, Austria, May 7-11,

2024
[20]

Gemma 3 Technical Report

Gemma 3 technical report. Preprint, arXiv:2503.19786. Ioannis Tsiamas, Gerard I. Gállego, José A. R. Fonol- losa, and Marta R. Costa-jussà

work page internal anchor Pith review Pith/arXiv arXiv
[21]

SHAS: Ap- proaching optimal Segmentation for End-to-End Speech Translation. InProc. Interspeech 2022, pages 106–110. Enes Yavuz Ugan, Ngoc-Quan Pham, and Alexander Waibel

2022
[22]

Changhan Wang, Anne Wu, and Juan Pino

Weight factorization and centralization for continual learning in speech recognition.arXiv preprint arXiv:2506.16574. Changhan Wang, Anne Wu, and Juan Pino

work page arXiv
[23]

Covost 2 and massively mul- tilingual speech-to-text translation,

Cov- ost 2: A massively multilingual speech-to-text trans- lation corpus.Preprint, arXiv:2007.10310. Dingdong Wang, Jincenzi Wu, Junan Li, Dongchao Yang, Xueyuan Chen, Tianhua Zhang, and Helen Meng

work page arXiv 2007
[24]

MMSU: A Massive Multi-task Spoken Language Understanding and Reasoning Benchmark

Mmsu: A massive multi-task spoken language understanding and reasoning benchmark. arXiv preprint arXiv:2506.04779. Jin Xu, Zhifang Guo, Jinzheng He, Hangrui Hu, Ting He, Shuai Bai, Keqin Chen, Jialin Wang, Yang Fan, Kai Dang, Bin Zhang, Xiong Wang, Yunfei Chu, and Junyang Lin. 2025a. Qwen2.5-omni technical report. arXiv preprint arXiv:2503.20215. Jin Xu, ...

work page internal anchor Pith review Pith/arXiv arXiv
[25]

Yaowei Zheng, Richong Zhang, Junhao Zhang, Yanhan Ye, Zheyan Luo, Zhangchi Feng, and Yongqiang Ma

Librisqa: Advancing free- form and open-ended spoken question answering with a novel dataset and framework.arXiv preprint arXiv:2308.10390. Yaowei Zheng, Richong Zhang, Junhao Zhang, Yanhan Ye, Zheyan Luo, Zhangchi Feng, and Yongqiang Ma

work page arXiv
[26]

In Findings of the Association for Computational Lin- guistics: ACL 2025, pages 8469–8490, Vienna, Aus- tria

Contrastive learn- ing for task-independent SpeechLLM-pretraining. In Findings of the Association for Computational Lin- guistics: ACL 2025, pages 8469–8490, Vienna, Aus- tria. Association for Computational Linguistics. Maike Züfle, Sara Papi, Beatrice Savoldi, Marco Gaido, Luisa Bentivogli, and Jan Niehues

2025
[27]

The answer is

NUT- SHELL: A dataset for abstract generation from scien- tific talks. InProceedings of the 22nd International Conference on Spoken Language Translation (IWSLT 2025), pages 19–32, Vienna, Austria (in-person and online). Association for Computational Linguistics. A Appendix B Data Prompts & InstructionsWe employ an LLM to generate paraphrased variants of m...

2025
[28]

An overview of the re-ranking strategies and the number of model calls required for each is given in Tab

for text-based comparison. An overview of the re-ranking strategies and the number of model calls required for each is given in Tab. A2. Method Description Model calls Likelihood Each candidate is scored independently by computing logp(candidate|audio) with an external model. The candidate with the highest likelihood is selected. N Comparison The model re...

2004
[29]

Results per language are shown in Tab

to evaluate our re-ranking methods using the official MCIF evaluation code. Results per language are shown in Tab. A3 (en), Tab. A4 (de), Tab. A5 (it), and Tab. A6 (zh). Re-ranking yields consistent improvements only for English, where Likelihood achieves the largest gain, driven primarily by a strong improvement on ASR. For Chinese, modest improvements a...

2026
[30]

This granular view helps identify how language-specific challenges—morphological complexity, data scarcity, and pre-training bias—affect different tasks and languages

excel on specific languages. This granular view helps identify how language-specific challenges—morphological complexity, data scarcity, and pre-training bias—affect different tasks and languages. EN ZH DE IT ID MethodSQA SSUM ASR SQA SSUM ST SQA SSUM ST SQA SSUM ST Baselines (1) Omni 24.93 19.34 53.40 43.03 9.59 75.88 28.34 11.55 61.16 26.82 16.35 68.92 ...

2026

[1] [1]

Phi-4 technical re- port.arXiv preprint arXiv:2412.08905. David Ifeoluwa Adelani, Antonios Anastasopoulos, Vic- tor Agostinelli, Luisa Bentivogli, Ondˇrej Bojar, Se- bastien Bratières, Marine Carpuat, Roldano Cattoni, Mauro Cettolo, Lizhong Chen, Marcello Federico, Marco Gaido, Mahendra Gupta, HyoJung Han, Ali Hatami, David Javorský, Yejin Jeon, Marek Kas...

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

InProceedings of the 23rd International Conference on Spoken Language Translation (IWSLT 2026), San Diego, California, US

Speech trans- lation and metrics in 2026: Findings of the iwslt campaign. InProceedings of the 23rd International Conference on Spoken Language Translation (IWSLT 2026), San Diego, California, US. Association for Computational Linguistics. Duarte M. Alves, José Pombal, Nuno M. Guerreiro, Pe- dro H. Martins, João Alves, Amin Farajian, Ben Pe- ters, Ricardo...

2026

[3] [3]

Preprint, arXiv:2402.17733

Tower: An open multilingual large language model for translation-related tasks. Preprint, arXiv:2402.17733. Loïc Barrault, Yu-An Chung, Mariano Cora Meglioli, David Dale, Ning Dong, Paul-Ambroise Duquenne, Hady Elsahar, Hongyu Gong, Kevin Heffernan, John Hoffman, and 1 others

work page arXiv

[4] [4]

arXiv preprint arXiv:2308.11596 , year=

Seamlessm4t: Mas- sively multilingual & multimodal machine transla- tion.arXiv preprint arXiv:2308.11596. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova

work page arXiv

[5] [5]

Bert: Pre-training of deep bidirectional transformers for language understand- ing. InProceedings of the 2019 conference of the North American chapter of the association for com- putational linguistics: human language technologies, volume 1 (long and short papers), pages 4171–4186. Mara Finkelstein, Isaac Caswell, Tobias Domhan, Jan- Thorsten Peter, Juraj...

2019

[6] [6]

arXiv preprint arXiv:2601.09012

Translategemma technical report. arXiv preprint arXiv:2601.09012. Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, and 1 others

work page arXiv

[7] [7]

InProceed- ings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstra- tions, pages 12–20

End-to-end evaluation for low- latency simultaneous speech translation. InProceed- ings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstra- tions, pages 12–20. Javier Iranzo-Sánchez, Joan Albert Silvestre-Cerdà, Javier Jorge, Nahuel Roselló, Adrià Giménez, Al- bert Sanchis, Jorge Civera, and Alfons Juan

2023

[8] [8]

InICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 8229–

Europarl-st: A multilingual corpus for speech trans- lation of parliamentary debates. InICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 8229–

2020

[9] [9]

Llama-3.1-foundationai-securityllm-base-8b technical report.arXiv preprint arXiv:2504.21039, 2025

Llama-3.1-foundationai- securityllm-base-8b technical report.arXiv preprint arXiv:2504.21039. Sai Koneru, Fabian Retkowski, Christian Huber, Lukas Hilgert, Seymanur Akti, Enes Yavuz Ugan, Alex Waibel, and Jan Niehues

work page arXiv

[10] [10]

Oleksii Kuchaiev, Jason Li, Huyen Nguyen, Oleksii Hrinchuk, Ryan Leary, Boris Ginsburg, Samuel Kri- man, Stanislav Beliaev, Vitaly Lavrukhin, Jack Cook, and 1 others

InProceedings of the 22nd International Conference on Spoken Language Translation (IWSLT 2025), pages 232–244. Oleksii Kuchaiev, Jason Li, Huyen Nguyen, Oleksii Hrinchuk, Ryan Leary, Boris Ginsburg, Samuel Kri- man, Stanislav Beliaev, Vitaly Lavrukhin, Jack Cook, and 1 others

2025

[11] [11]

Nemo: a toolkit for building ai applications using neural modules,

Nemo: a toolkit for building ai applications using neural modules.arXiv preprint arXiv:1909.09577. Shankar Kumar and William Byrne

work page arXiv 1909

[12] [12]

Minimum Bayes-risk decoding for statistical machine transla- tion. InProceedings of the Human Language Tech- nology Conference of the North American Chapter of the Association for Computational Linguistics: HLT-NAACL 2004, pages 169–176, Boston, Mas- sachusetts, USA. Association for Computational Lin- guistics. Lujun Li, Lama Sleem, Geoffrey Nichil, Radu ...

2004

[13] [13]

arXiv preprint arXiv:2105.03010

Efficient weight factorization for multilingual speech recognition. arXiv preprint arXiv:2105.03010. Alec Radford, Jong Wook Kim, Tao Xu, Greg Brock- man, Christine McLeavey, and Ilya Sutskever

work page arXiv

[14] [14]

Robust Speech Recognition via Large-Scale Weak Supervision

Robust speech recognition via large-scale weak su- pervision.Preprint, arXiv:2212.04356. Ricardo Rei, Marcos Treviso, Nuno M Guerreiro, Chrysoula Zerva, Ana C Farinha, Christine Maroti, José GC De Souza, Taisiya Glushkova, Duarte Alves, Luisa Coheur, and 1 others

work page internal anchor Pith review Pith/arXiv arXiv

[15] [15]

InProceedings of the Seventh Confer- ence on Machine Translation (WMT), pages 634–

Cometkiwi: Ist- unbabel 2022 submission for the quality estimation shared task. InProceedings of the Seventh Confer- ence on Machine Translation (WMT), pages 634–

2022

[16] [16]

InProceedings of the 2025 Con- ference on Empirical Methods in Natural Language Processing, pages 27275–27306, Suzhou, China

Summarizing speech: A compre- hensive survey. InProceedings of the 2025 Con- ference on Empirical Methods in Natural Language Processing, pages 27275–27306, Suzhou, China. As- sociation for Computational Linguistics. Fabian Retkowski, Maike Züfle, Thai Binh Nguyen, Jan Niehues, and Alexander Waibel

2025

[17] [17]

Beyond Transcripts: A Renewed Perspective on Audio Chaptering

Beyond tran- scripts: A renewed perspective on audio chaptering. Preprint, arXiv:2602.08979. Supriti Sinhamahapatra and Jan Niehues

work page internal anchor Pith review Pith/arXiv arXiv

[18] [18]

InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 16111–16121

Do slides help? multi-modal context for automatic tran- scription of conference talks. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 16111–16121. Changli Tang, Wenyi Yu, Guangzhi Sun, Xianzhao Chen, Tian Tan, Wei Li, Lu Lu, Zejun Ma, and Chao Zhang

2025

[19] [19]

InThe Twelfth International Conference on Learning Representa- tions, ICLR 2024, Vienna, Austria, May 7-11,

SALMONN: towards generic hearing abilities for large language models. InThe Twelfth International Conference on Learning Representa- tions, ICLR 2024, Vienna, Austria, May 7-11,

2024

[20] [20]

Gemma 3 Technical Report

Gemma 3 technical report. Preprint, arXiv:2503.19786. Ioannis Tsiamas, Gerard I. Gállego, José A. R. Fonol- losa, and Marta R. Costa-jussà

work page internal anchor Pith review Pith/arXiv arXiv

[21] [21]

SHAS: Ap- proaching optimal Segmentation for End-to-End Speech Translation. InProc. Interspeech 2022, pages 106–110. Enes Yavuz Ugan, Ngoc-Quan Pham, and Alexander Waibel

2022

[22] [22]

Changhan Wang, Anne Wu, and Juan Pino

Weight factorization and centralization for continual learning in speech recognition.arXiv preprint arXiv:2506.16574. Changhan Wang, Anne Wu, and Juan Pino

work page arXiv

[23] [23]

Covost 2 and massively mul- tilingual speech-to-text translation,

Cov- ost 2: A massively multilingual speech-to-text trans- lation corpus.Preprint, arXiv:2007.10310. Dingdong Wang, Jincenzi Wu, Junan Li, Dongchao Yang, Xueyuan Chen, Tianhua Zhang, and Helen Meng

work page arXiv 2007

[24] [24]

MMSU: A Massive Multi-task Spoken Language Understanding and Reasoning Benchmark

Mmsu: A massive multi-task spoken language understanding and reasoning benchmark. arXiv preprint arXiv:2506.04779. Jin Xu, Zhifang Guo, Jinzheng He, Hangrui Hu, Ting He, Shuai Bai, Keqin Chen, Jialin Wang, Yang Fan, Kai Dang, Bin Zhang, Xiong Wang, Yunfei Chu, and Junyang Lin. 2025a. Qwen2.5-omni technical report. arXiv preprint arXiv:2503.20215. Jin Xu, ...

work page internal anchor Pith review Pith/arXiv arXiv

[25] [25]

Yaowei Zheng, Richong Zhang, Junhao Zhang, Yanhan Ye, Zheyan Luo, Zhangchi Feng, and Yongqiang Ma

Librisqa: Advancing free- form and open-ended spoken question answering with a novel dataset and framework.arXiv preprint arXiv:2308.10390. Yaowei Zheng, Richong Zhang, Junhao Zhang, Yanhan Ye, Zheyan Luo, Zhangchi Feng, and Yongqiang Ma

work page arXiv

[26] [26]

In Findings of the Association for Computational Lin- guistics: ACL 2025, pages 8469–8490, Vienna, Aus- tria

Contrastive learn- ing for task-independent SpeechLLM-pretraining. In Findings of the Association for Computational Lin- guistics: ACL 2025, pages 8469–8490, Vienna, Aus- tria. Association for Computational Linguistics. Maike Züfle, Sara Papi, Beatrice Savoldi, Marco Gaido, Luisa Bentivogli, and Jan Niehues

2025

[27] [27]

The answer is

NUT- SHELL: A dataset for abstract generation from scien- tific talks. InProceedings of the 22nd International Conference on Spoken Language Translation (IWSLT 2025), pages 19–32, Vienna, Austria (in-person and online). Association for Computational Linguistics. A Appendix B Data Prompts & InstructionsWe employ an LLM to generate paraphrased variants of m...

2025

[28] [28]

An overview of the re-ranking strategies and the number of model calls required for each is given in Tab

for text-based comparison. An overview of the re-ranking strategies and the number of model calls required for each is given in Tab. A2. Method Description Model calls Likelihood Each candidate is scored independently by computing logp(candidate|audio) with an external model. The candidate with the highest likelihood is selected. N Comparison The model re...

2004

[29] [29]

Results per language are shown in Tab

to evaluate our re-ranking methods using the official MCIF evaluation code. Results per language are shown in Tab. A3 (en), Tab. A4 (de), Tab. A5 (it), and Tab. A6 (zh). Re-ranking yields consistent improvements only for English, where Likelihood achieves the largest gain, driven primarily by a strong improvement on ASR. For Chinese, modest improvements a...

2026

[30] [30]

This granular view helps identify how language-specific challenges—morphological complexity, data scarcity, and pre-training bias—affect different tasks and languages

excel on specific languages. This granular view helps identify how language-specific challenges—morphological complexity, data scarcity, and pre-training bias—affect different tasks and languages. EN ZH DE IT ID MethodSQA SSUM ASR SQA SSUM ST SQA SSUM ST SQA SSUM ST Baselines (1) Omni 24.93 19.34 53.40 43.03 9.59 75.88 28.34 11.55 61.16 26.82 16.35 68.92 ...

2026