Multilingual Long-Form Speech Instruction Following: KIT's Submission to IWSLT 2026
Pith reviewed 2026-06-28 06:19 UTC · model grok-4.3
The pith
Likelihood re-ranking degrades semantic performance in long-form speech by favoring segmented-audio candidates, but Minimum Bayes Risk decoding corrects the failure.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
A general data augmentation pipeline converts short-form corpora into long-form training data through segment concatenation, LLM-based label generation, and cross-lingual translation, yielding over 1M instances across six tasks and four languages. Likelihood-based re-ranking, while highly effective for ASR, systematically degrades semantic tasks by spuriously selecting candidates generated from segmented audio processing rather than holistic long-form inference; this failure mode is resolved by combining likelihood with Minimum Bayes Risk decoding.
What carries the argument
Likelihood re-ranking combined with Minimum Bayes Risk decoding, which prevents selection of segmented-audio candidates on semantic tasks.
If this is right
- The same augmentation pipeline can be applied to additional languages and tasks without task-specific redesign.
- Minimum Bayes Risk decoding becomes necessary whenever re-ranking is used on semantic rather than purely acoustic objectives.
- Systems can be prepared for unknown surprise tasks by training only on the generated long-form mixture rather than on known task distributions.
- The observed failure of likelihood re-ranking is specific to long-form inputs and does not appear in short-form ASR settings.
Where Pith is reading between the lines
- The same selection bias may appear in any pipeline that mixes short-segment and full-context decoding paths, suggesting a general need for risk-based decoding in long-context speech systems.
- If the augmented data distribution proves sufficiently close to real long-form speech, the approach could reduce reliance on scarce naturally occurring long recordings for training.
- The result implies that future instruction-following benchmarks should include explicit long-form audio to expose re-ranking artifacts that short-form tests miss.
Load-bearing premise
Converting short-form corpora into long-form training data via segment concatenation, LLM-based label generation, and cross-lingual translation produces examples whose distribution is close enough to real long-form inputs that models trained on them will generalize to genuine long-form instruction following.
What would settle it
A controlled test on held-out real long-form audio showing that pure likelihood re-ranking selects segmented candidates at the same or lower rate than holistic candidates on semantic tasks, or that the augmented training data yields no gain over short-form-only training on genuine long-form inputs.
read the original abstract
With the advent of Large Language Models, single-task and token-based multi-task models have evolved into instruction-based systems that infer task and target language implicitly from natural language prompts. This trend is reflected in IWSLT's Instruction Following Track, which this year introduced new tasks including an unknown surprise task, posing a genuine challenge against overfitting to known tasks. We present KIT's submission to the Long and Short Instruction Following tracks in the unconstrained setting. Our approach combines a general data augmentation pipeline that converts short-form corpora into long-form training data through segment concatenation, LLM-based label generation, and cross-lingual translation, yielding over 1M instances across six tasks and four languages. We further show that likelihood-based re-ranking, while highly effective for ASR, systematically degrades semantic tasks by spuriously selecting candidates generated from segmented audio processing rather than holistic long-form inference, a failure mode resolved by combining likelihood with Minimum Bayes Risk decoding.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript describes KIT's unconstrained submission to the IWSLT 2026 Long and Short Instruction Following tracks. It presents a data augmentation pipeline that converts short-form corpora into over 1M long-form training instances across six tasks and four languages via segment concatenation, LLM-based label generation, and cross-lingual translation. The authors identify that likelihood-based re-ranking, while effective for ASR, degrades semantic tasks by favoring candidates from segmented audio processing over holistic long-form inference, and propose resolving this by combining likelihood scores with Minimum Bayes Risk decoding.
Significance. If the empirical results hold, the work provides a practical, scalable method for creating long-form speech instruction data and isolates a concrete failure mode in likelihood re-ranking for semantic tasks together with a mitigation strategy. This could inform system design for multilingual long-form speech processing. The paper is credited for its focus on an unknown surprise task and for explicitly documenting the re-ranking artifact rather than treating it as a black-box hyperparameter.
major comments (2)
- [Data Augmentation Pipeline] Data Augmentation section: The claim that the >1M synthetic examples are distributionally close enough to genuine long-form inputs for models to generalize rests on an untested assumption. No quantitative comparisons (length histograms, instruction complexity metrics, acoustic coherence, or semantic divergence scores) are supplied between the concatenated/translated set and the IWSLT long-form test distribution, which is load-bearing for both baseline performance and the reported re-ranking/MBR contrast.
- [Re-ranking Analysis] Re-ranking and MBR section: The central claim that likelihood re-ranking systematically degrades semantic tasks (while MBR combination resolves it) is stated without any supporting numbers, ablation tables, or error analysis. This absence prevents verification of the failure mode and the proposed fix.
minor comments (1)
- [Abstract] The abstract would be strengthened by including at least one key quantitative result (e.g., WER or semantic accuracy delta) so that the re-ranking observation can be assessed without reading the full text.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our IWSLT 2026 submission. We address the two major comments point by point below and will revise the manuscript to include the requested analyses.
read point-by-point responses
-
Referee: [Data Augmentation Pipeline] Data Augmentation section: The claim that the >1M synthetic examples are distributionally close enough to genuine long-form inputs for models to generalize rests on an untested assumption. No quantitative comparisons (length histograms, instruction complexity metrics, acoustic coherence, or semantic divergence scores) are supplied between the concatenated/translated set and the IWSLT long-form test distribution, which is load-bearing for both baseline performance and the reported re-ranking/MBR contrast.
Authors: We agree that quantitative comparisons would strengthen the distributional similarity claim. In the revised manuscript we will add length histograms, average instruction lengths, and semantic similarity scores (via multilingual sentence embeddings) between the augmented training set and the IWSLT long-form test distribution. revision: yes
-
Referee: [Re-ranking Analysis] Re-ranking and MBR section: The central claim that likelihood re-ranking systematically degrades semantic tasks (while MBR combination resolves it) is stated without any supporting numbers, ablation tables, or error analysis. This absence prevents verification of the failure mode and the proposed fix.
Authors: We acknowledge the absence of explicit numbers and ablations. The revised version will include a performance table for semantic tasks comparing likelihood re-ranking alone versus likelihood combined with MBR, plus a short error analysis with illustrative examples of the failure mode. revision: yes
Circularity Check
Empirical system description with no derivation chain or self-referential predictions
full rationale
The paper is a competition submission describing a data augmentation pipeline (segment concatenation, LLM label generation, cross-lingual translation) and a decoding combination (likelihood + MBR). No equations, fitted parameters presented as predictions, or load-bearing self-citations appear. Claims rest on empirical IWSLT results rather than any reduction to inputs by construction. This is the most common honest finding for system papers.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Phi-4 technical re- port.arXiv preprint arXiv:2412.08905. David Ifeoluwa Adelani, Antonios Anastasopoulos, Vic- tor Agostinelli, Luisa Bentivogli, Ondˇrej Bojar, Se- bastien Bratières, Marine Carpuat, Roldano Cattoni, Mauro Cettolo, Lizhong Chen, Marcello Federico, Marco Gaido, Mahendra Gupta, HyoJung Han, Ali Hatami, David Javorský, Yejin Jeon, Marek Kas...
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
InProceedings of the 23rd International Conference on Spoken Language Translation (IWSLT 2026), San Diego, California, US
Speech trans- lation and metrics in 2026: Findings of the iwslt campaign. InProceedings of the 23rd International Conference on Spoken Language Translation (IWSLT 2026), San Diego, California, US. Association for Computational Linguistics. Duarte M. Alves, José Pombal, Nuno M. Guerreiro, Pe- dro H. Martins, João Alves, Amin Farajian, Ben Pe- ters, Ricardo...
2026
-
[3]
Tower: An open multilingual large language model for translation-related tasks. Preprint, arXiv:2402.17733. Loïc Barrault, Yu-An Chung, Mariano Cora Meglioli, David Dale, Ning Dong, Paul-Ambroise Duquenne, Hady Elsahar, Hongyu Gong, Kevin Heffernan, John Hoffman, and 1 others
-
[4]
arXiv preprint arXiv:2308.11596 , year=
Seamlessm4t: Mas- sively multilingual & multimodal machine transla- tion.arXiv preprint arXiv:2308.11596. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova
-
[5]
Bert: Pre-training of deep bidirectional transformers for language understand- ing. InProceedings of the 2019 conference of the North American chapter of the association for com- putational linguistics: human language technologies, volume 1 (long and short papers), pages 4171–4186. Mara Finkelstein, Isaac Caswell, Tobias Domhan, Jan- Thorsten Peter, Juraj...
2019
-
[6]
arXiv preprint arXiv:2601.09012
Translategemma technical report. arXiv preprint arXiv:2601.09012. Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, and 1 others
-
[7]
InProceed- ings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstra- tions, pages 12–20
End-to-end evaluation for low- latency simultaneous speech translation. InProceed- ings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstra- tions, pages 12–20. Javier Iranzo-Sánchez, Joan Albert Silvestre-Cerdà, Javier Jorge, Nahuel Roselló, Adrià Giménez, Al- bert Sanchis, Jorge Civera, and Alfons Juan
2023
-
[8]
InICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 8229–
Europarl-st: A multilingual corpus for speech trans- lation of parliamentary debates. InICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 8229–
2020
-
[9]
Llama-3.1-foundationai-securityllm-base-8b technical report.arXiv preprint arXiv:2504.21039, 2025
Llama-3.1-foundationai- securityllm-base-8b technical report.arXiv preprint arXiv:2504.21039. Sai Koneru, Fabian Retkowski, Christian Huber, Lukas Hilgert, Seymanur Akti, Enes Yavuz Ugan, Alex Waibel, and Jan Niehues
-
[10]
Oleksii Kuchaiev, Jason Li, Huyen Nguyen, Oleksii Hrinchuk, Ryan Leary, Boris Ginsburg, Samuel Kri- man, Stanislav Beliaev, Vitaly Lavrukhin, Jack Cook, and 1 others
InProceedings of the 22nd International Conference on Spoken Language Translation (IWSLT 2025), pages 232–244. Oleksii Kuchaiev, Jason Li, Huyen Nguyen, Oleksii Hrinchuk, Ryan Leary, Boris Ginsburg, Samuel Kri- man, Stanislav Beliaev, Vitaly Lavrukhin, Jack Cook, and 1 others
2025
-
[11]
Nemo: a toolkit for building ai applications using neural modules,
Nemo: a toolkit for building ai applications using neural modules.arXiv preprint arXiv:1909.09577. Shankar Kumar and William Byrne
-
[12]
Minimum Bayes-risk decoding for statistical machine transla- tion. InProceedings of the Human Language Tech- nology Conference of the North American Chapter of the Association for Computational Linguistics: HLT-NAACL 2004, pages 169–176, Boston, Mas- sachusetts, USA. Association for Computational Lin- guistics. Lujun Li, Lama Sleem, Geoffrey Nichil, Radu ...
2004
-
[13]
arXiv preprint arXiv:2105.03010
Efficient weight factorization for multilingual speech recognition. arXiv preprint arXiv:2105.03010. Alec Radford, Jong Wook Kim, Tao Xu, Greg Brock- man, Christine McLeavey, and Ilya Sutskever
-
[14]
Robust Speech Recognition via Large-Scale Weak Supervision
Robust speech recognition via large-scale weak su- pervision.Preprint, arXiv:2212.04356. Ricardo Rei, Marcos Treviso, Nuno M Guerreiro, Chrysoula Zerva, Ana C Farinha, Christine Maroti, José GC De Souza, Taisiya Glushkova, Duarte Alves, Luisa Coheur, and 1 others
work page internal anchor Pith review Pith/arXiv arXiv
-
[15]
InProceedings of the Seventh Confer- ence on Machine Translation (WMT), pages 634–
Cometkiwi: Ist- unbabel 2022 submission for the quality estimation shared task. InProceedings of the Seventh Confer- ence on Machine Translation (WMT), pages 634–
2022
-
[16]
InProceedings of the 2025 Con- ference on Empirical Methods in Natural Language Processing, pages 27275–27306, Suzhou, China
Summarizing speech: A compre- hensive survey. InProceedings of the 2025 Con- ference on Empirical Methods in Natural Language Processing, pages 27275–27306, Suzhou, China. As- sociation for Computational Linguistics. Fabian Retkowski, Maike Züfle, Thai Binh Nguyen, Jan Niehues, and Alexander Waibel
2025
-
[17]
Beyond Transcripts: A Renewed Perspective on Audio Chaptering
Beyond tran- scripts: A renewed perspective on audio chaptering. Preprint, arXiv:2602.08979. Supriti Sinhamahapatra and Jan Niehues
work page internal anchor Pith review Pith/arXiv arXiv
-
[18]
InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 16111–16121
Do slides help? multi-modal context for automatic tran- scription of conference talks. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 16111–16121. Changli Tang, Wenyi Yu, Guangzhi Sun, Xianzhao Chen, Tian Tan, Wei Li, Lu Lu, Zejun Ma, and Chao Zhang
2025
-
[19]
InThe Twelfth International Conference on Learning Representa- tions, ICLR 2024, Vienna, Austria, May 7-11,
SALMONN: towards generic hearing abilities for large language models. InThe Twelfth International Conference on Learning Representa- tions, ICLR 2024, Vienna, Austria, May 7-11,
2024
-
[20]
Gemma 3 technical report. Preprint, arXiv:2503.19786. Ioannis Tsiamas, Gerard I. Gállego, José A. R. Fonol- losa, and Marta R. Costa-jussà
work page internal anchor Pith review Pith/arXiv arXiv
-
[21]
SHAS: Ap- proaching optimal Segmentation for End-to-End Speech Translation. InProc. Interspeech 2022, pages 106–110. Enes Yavuz Ugan, Ngoc-Quan Pham, and Alexander Waibel
2022
-
[22]
Changhan Wang, Anne Wu, and Juan Pino
Weight factorization and centralization for continual learning in speech recognition.arXiv preprint arXiv:2506.16574. Changhan Wang, Anne Wu, and Juan Pino
-
[23]
Covost 2 and massively mul- tilingual speech-to-text translation,
Cov- ost 2: A massively multilingual speech-to-text trans- lation corpus.Preprint, arXiv:2007.10310. Dingdong Wang, Jincenzi Wu, Junan Li, Dongchao Yang, Xueyuan Chen, Tianhua Zhang, and Helen Meng
-
[24]
MMSU: A Massive Multi-task Spoken Language Understanding and Reasoning Benchmark
Mmsu: A massive multi-task spoken language understanding and reasoning benchmark. arXiv preprint arXiv:2506.04779. Jin Xu, Zhifang Guo, Jinzheng He, Hangrui Hu, Ting He, Shuai Bai, Keqin Chen, Jialin Wang, Yang Fan, Kai Dang, Bin Zhang, Xiong Wang, Yunfei Chu, and Junyang Lin. 2025a. Qwen2.5-omni technical report. arXiv preprint arXiv:2503.20215. Jin Xu, ...
work page internal anchor Pith review Pith/arXiv arXiv
-
[25]
Yaowei Zheng, Richong Zhang, Junhao Zhang, Yanhan Ye, Zheyan Luo, Zhangchi Feng, and Yongqiang Ma
Librisqa: Advancing free- form and open-ended spoken question answering with a novel dataset and framework.arXiv preprint arXiv:2308.10390. Yaowei Zheng, Richong Zhang, Junhao Zhang, Yanhan Ye, Zheyan Luo, Zhangchi Feng, and Yongqiang Ma
-
[26]
In Findings of the Association for Computational Lin- guistics: ACL 2025, pages 8469–8490, Vienna, Aus- tria
Contrastive learn- ing for task-independent SpeechLLM-pretraining. In Findings of the Association for Computational Lin- guistics: ACL 2025, pages 8469–8490, Vienna, Aus- tria. Association for Computational Linguistics. Maike Züfle, Sara Papi, Beatrice Savoldi, Marco Gaido, Luisa Bentivogli, and Jan Niehues
2025
-
[27]
The answer is
NUT- SHELL: A dataset for abstract generation from scien- tific talks. InProceedings of the 22nd International Conference on Spoken Language Translation (IWSLT 2025), pages 19–32, Vienna, Austria (in-person and online). Association for Computational Linguistics. A Appendix B Data Prompts & InstructionsWe employ an LLM to generate paraphrased variants of m...
2025
-
[28]
An overview of the re-ranking strategies and the number of model calls required for each is given in Tab
for text-based comparison. An overview of the re-ranking strategies and the number of model calls required for each is given in Tab. A2. Method Description Model calls Likelihood Each candidate is scored independently by computing logp(candidate|audio) with an external model. The candidate with the highest likelihood is selected. N Comparison The model re...
2004
-
[29]
Results per language are shown in Tab
to evaluate our re-ranking methods using the official MCIF evaluation code. Results per language are shown in Tab. A3 (en), Tab. A4 (de), Tab. A5 (it), and Tab. A6 (zh). Re-ranking yields consistent improvements only for English, where Likelihood achieves the largest gain, driven primarily by a strong improvement on ASR. For Chinese, modest improvements a...
2026
-
[30]
This granular view helps identify how language-specific challenges—morphological complexity, data scarcity, and pre-training bias—affect different tasks and languages
excel on specific languages. This granular view helps identify how language-specific challenges—morphological complexity, data scarcity, and pre-training bias—affect different tasks and languages. EN ZH DE IT ID MethodSQA SSUM ASR SQA SSUM ST SQA SSUM ST SQA SSUM ST Baselines (1) Omni 24.93 19.34 53.40 43.03 9.59 75.88 28.34 11.55 61.16 26.82 16.35 68.92 ...
2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.