pith. sign in

arxiv: 2606.04730 · v1 · pith:YYWTWFFDnew · submitted 2026-06-03 · 💻 cs.CL · eess.AS

Multilingual Long-Form Speech Instruction Following: KIT's Submission to IWSLT 2026

Pith reviewed 2026-06-28 06:19 UTC · model grok-4.3

classification 💻 cs.CL eess.AS
keywords long-form speechinstruction followingdata augmentationminimum Bayes risklikelihood re-rankingmultilingual ASRsemantic tasksIWSLT
0
0 comments X

The pith

Likelihood re-ranking degrades semantic performance in long-form speech by favoring segmented-audio candidates, but Minimum Bayes Risk decoding corrects the failure.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper describes a data augmentation pipeline that turns short-form speech corpora into over one million long-form training examples by concatenating segments, generating labels with an LLM, and translating across languages to cover six tasks in four languages. It then demonstrates that pure likelihood-based re-ranking, which works well for ASR, harms semantic instruction-following tasks because it prefers candidates produced from short audio segments rather than from full long-form processing. Combining the likelihood scores with Minimum Bayes Risk decoding avoids this selection bias and restores performance on the semantic tasks. The work is presented as KIT's entry in the unconstrained track of the IWSLT 2026 Instruction Following evaluation, which includes an unknown surprise task.

Core claim

A general data augmentation pipeline converts short-form corpora into long-form training data through segment concatenation, LLM-based label generation, and cross-lingual translation, yielding over 1M instances across six tasks and four languages. Likelihood-based re-ranking, while highly effective for ASR, systematically degrades semantic tasks by spuriously selecting candidates generated from segmented audio processing rather than holistic long-form inference; this failure mode is resolved by combining likelihood with Minimum Bayes Risk decoding.

What carries the argument

Likelihood re-ranking combined with Minimum Bayes Risk decoding, which prevents selection of segmented-audio candidates on semantic tasks.

If this is right

  • The same augmentation pipeline can be applied to additional languages and tasks without task-specific redesign.
  • Minimum Bayes Risk decoding becomes necessary whenever re-ranking is used on semantic rather than purely acoustic objectives.
  • Systems can be prepared for unknown surprise tasks by training only on the generated long-form mixture rather than on known task distributions.
  • The observed failure of likelihood re-ranking is specific to long-form inputs and does not appear in short-form ASR settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same selection bias may appear in any pipeline that mixes short-segment and full-context decoding paths, suggesting a general need for risk-based decoding in long-context speech systems.
  • If the augmented data distribution proves sufficiently close to real long-form speech, the approach could reduce reliance on scarce naturally occurring long recordings for training.
  • The result implies that future instruction-following benchmarks should include explicit long-form audio to expose re-ranking artifacts that short-form tests miss.

Load-bearing premise

Converting short-form corpora into long-form training data via segment concatenation, LLM-based label generation, and cross-lingual translation produces examples whose distribution is close enough to real long-form inputs that models trained on them will generalize to genuine long-form instruction following.

What would settle it

A controlled test on held-out real long-form audio showing that pure likelihood re-ranking selects segmented candidates at the same or lower rate than holistic candidates on semantic tasks, or that the augmented training data yields no gain over short-form-only training on genuine long-form inputs.

read the original abstract

With the advent of Large Language Models, single-task and token-based multi-task models have evolved into instruction-based systems that infer task and target language implicitly from natural language prompts. This trend is reflected in IWSLT's Instruction Following Track, which this year introduced new tasks including an unknown surprise task, posing a genuine challenge against overfitting to known tasks. We present KIT's submission to the Long and Short Instruction Following tracks in the unconstrained setting. Our approach combines a general data augmentation pipeline that converts short-form corpora into long-form training data through segment concatenation, LLM-based label generation, and cross-lingual translation, yielding over 1M instances across six tasks and four languages. We further show that likelihood-based re-ranking, while highly effective for ASR, systematically degrades semantic tasks by spuriously selecting candidates generated from segmented audio processing rather than holistic long-form inference, a failure mode resolved by combining likelihood with Minimum Bayes Risk decoding.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript describes KIT's unconstrained submission to the IWSLT 2026 Long and Short Instruction Following tracks. It presents a data augmentation pipeline that converts short-form corpora into over 1M long-form training instances across six tasks and four languages via segment concatenation, LLM-based label generation, and cross-lingual translation. The authors identify that likelihood-based re-ranking, while effective for ASR, degrades semantic tasks by favoring candidates from segmented audio processing over holistic long-form inference, and propose resolving this by combining likelihood scores with Minimum Bayes Risk decoding.

Significance. If the empirical results hold, the work provides a practical, scalable method for creating long-form speech instruction data and isolates a concrete failure mode in likelihood re-ranking for semantic tasks together with a mitigation strategy. This could inform system design for multilingual long-form speech processing. The paper is credited for its focus on an unknown surprise task and for explicitly documenting the re-ranking artifact rather than treating it as a black-box hyperparameter.

major comments (2)
  1. [Data Augmentation Pipeline] Data Augmentation section: The claim that the >1M synthetic examples are distributionally close enough to genuine long-form inputs for models to generalize rests on an untested assumption. No quantitative comparisons (length histograms, instruction complexity metrics, acoustic coherence, or semantic divergence scores) are supplied between the concatenated/translated set and the IWSLT long-form test distribution, which is load-bearing for both baseline performance and the reported re-ranking/MBR contrast.
  2. [Re-ranking Analysis] Re-ranking and MBR section: The central claim that likelihood re-ranking systematically degrades semantic tasks (while MBR combination resolves it) is stated without any supporting numbers, ablation tables, or error analysis. This absence prevents verification of the failure mode and the proposed fix.
minor comments (1)
  1. [Abstract] The abstract would be strengthened by including at least one key quantitative result (e.g., WER or semantic accuracy delta) so that the re-ranking observation can be assessed without reading the full text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our IWSLT 2026 submission. We address the two major comments point by point below and will revise the manuscript to include the requested analyses.

read point-by-point responses
  1. Referee: [Data Augmentation Pipeline] Data Augmentation section: The claim that the >1M synthetic examples are distributionally close enough to genuine long-form inputs for models to generalize rests on an untested assumption. No quantitative comparisons (length histograms, instruction complexity metrics, acoustic coherence, or semantic divergence scores) are supplied between the concatenated/translated set and the IWSLT long-form test distribution, which is load-bearing for both baseline performance and the reported re-ranking/MBR contrast.

    Authors: We agree that quantitative comparisons would strengthen the distributional similarity claim. In the revised manuscript we will add length histograms, average instruction lengths, and semantic similarity scores (via multilingual sentence embeddings) between the augmented training set and the IWSLT long-form test distribution. revision: yes

  2. Referee: [Re-ranking Analysis] Re-ranking and MBR section: The central claim that likelihood re-ranking systematically degrades semantic tasks (while MBR combination resolves it) is stated without any supporting numbers, ablation tables, or error analysis. This absence prevents verification of the failure mode and the proposed fix.

    Authors: We acknowledge the absence of explicit numbers and ablations. The revised version will include a performance table for semantic tasks comparing likelihood re-ranking alone versus likelihood combined with MBR, plus a short error analysis with illustrative examples of the failure mode. revision: yes

Circularity Check

0 steps flagged

Empirical system description with no derivation chain or self-referential predictions

full rationale

The paper is a competition submission describing a data augmentation pipeline (segment concatenation, LLM label generation, cross-lingual translation) and a decoding combination (likelihood + MBR). No equations, fitted parameters presented as predictions, or load-bearing self-citations appear. Claims rest on empirical IWSLT results rather than any reduction to inputs by construction. This is the most common honest finding for system papers.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract mentions no free parameters, mathematical axioms, or newly postulated entities; the work relies on standard machine-learning components whose details are not supplied.

pith-pipeline@v0.9.1-grok · 5725 in / 1294 out tokens · 31996 ms · 2026-06-28T06:19:44.994607+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

30 extracted references · 14 canonical work pages · 5 internal anchors

  1. [1]

    Phi-4 technical re- port.arXiv preprint arXiv:2412.08905. David Ifeoluwa Adelani, Antonios Anastasopoulos, Vic- tor Agostinelli, Luisa Bentivogli, Ondˇrej Bojar, Se- bastien Bratières, Marine Carpuat, Roldano Cattoni, Mauro Cettolo, Lizhong Chen, Marcello Federico, Marco Gaido, Mahendra Gupta, HyoJung Han, Ali Hatami, David Javorský, Yejin Jeon, Marek Kas...

  2. [2]

    InProceedings of the 23rd International Conference on Spoken Language Translation (IWSLT 2026), San Diego, California, US

    Speech trans- lation and metrics in 2026: Findings of the iwslt campaign. InProceedings of the 23rd International Conference on Spoken Language Translation (IWSLT 2026), San Diego, California, US. Association for Computational Linguistics. Duarte M. Alves, José Pombal, Nuno M. Guerreiro, Pe- dro H. Martins, João Alves, Amin Farajian, Ben Pe- ters, Ricardo...

  3. [3]

    Preprint, arXiv:2402.17733

    Tower: An open multilingual large language model for translation-related tasks. Preprint, arXiv:2402.17733. Loïc Barrault, Yu-An Chung, Mariano Cora Meglioli, David Dale, Ning Dong, Paul-Ambroise Duquenne, Hady Elsahar, Hongyu Gong, Kevin Heffernan, John Hoffman, and 1 others

  4. [4]

    arXiv preprint arXiv:2308.11596 , year=

    Seamlessm4t: Mas- sively multilingual & multimodal machine transla- tion.arXiv preprint arXiv:2308.11596. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova

  5. [5]

    Bert: Pre-training of deep bidirectional transformers for language understand- ing. InProceedings of the 2019 conference of the North American chapter of the association for com- putational linguistics: human language technologies, volume 1 (long and short papers), pages 4171–4186. Mara Finkelstein, Isaac Caswell, Tobias Domhan, Jan- Thorsten Peter, Juraj...

  6. [6]

    arXiv preprint arXiv:2601.09012

    Translategemma technical report. arXiv preprint arXiv:2601.09012. Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, and 1 others

  7. [7]

    InProceed- ings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstra- tions, pages 12–20

    End-to-end evaluation for low- latency simultaneous speech translation. InProceed- ings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstra- tions, pages 12–20. Javier Iranzo-Sánchez, Joan Albert Silvestre-Cerdà, Javier Jorge, Nahuel Roselló, Adrià Giménez, Al- bert Sanchis, Jorge Civera, and Alfons Juan

  8. [8]

    InICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 8229–

    Europarl-st: A multilingual corpus for speech trans- lation of parliamentary debates. InICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 8229–

  9. [9]

    Llama-3.1-foundationai-securityllm-base-8b technical report.arXiv preprint arXiv:2504.21039, 2025

    Llama-3.1-foundationai- securityllm-base-8b technical report.arXiv preprint arXiv:2504.21039. Sai Koneru, Fabian Retkowski, Christian Huber, Lukas Hilgert, Seymanur Akti, Enes Yavuz Ugan, Alex Waibel, and Jan Niehues

  10. [10]

    Oleksii Kuchaiev, Jason Li, Huyen Nguyen, Oleksii Hrinchuk, Ryan Leary, Boris Ginsburg, Samuel Kri- man, Stanislav Beliaev, Vitaly Lavrukhin, Jack Cook, and 1 others

    InProceedings of the 22nd International Conference on Spoken Language Translation (IWSLT 2025), pages 232–244. Oleksii Kuchaiev, Jason Li, Huyen Nguyen, Oleksii Hrinchuk, Ryan Leary, Boris Ginsburg, Samuel Kri- man, Stanislav Beliaev, Vitaly Lavrukhin, Jack Cook, and 1 others

  11. [11]

    Nemo: a toolkit for building ai applications using neural modules,

    Nemo: a toolkit for building ai applications using neural modules.arXiv preprint arXiv:1909.09577. Shankar Kumar and William Byrne

  12. [12]

    Minimum Bayes-risk decoding for statistical machine transla- tion. InProceedings of the Human Language Tech- nology Conference of the North American Chapter of the Association for Computational Linguistics: HLT-NAACL 2004, pages 169–176, Boston, Mas- sachusetts, USA. Association for Computational Lin- guistics. Lujun Li, Lama Sleem, Geoffrey Nichil, Radu ...

  13. [13]

    arXiv preprint arXiv:2105.03010

    Efficient weight factorization for multilingual speech recognition. arXiv preprint arXiv:2105.03010. Alec Radford, Jong Wook Kim, Tao Xu, Greg Brock- man, Christine McLeavey, and Ilya Sutskever

  14. [14]

    Robust Speech Recognition via Large-Scale Weak Supervision

    Robust speech recognition via large-scale weak su- pervision.Preprint, arXiv:2212.04356. Ricardo Rei, Marcos Treviso, Nuno M Guerreiro, Chrysoula Zerva, Ana C Farinha, Christine Maroti, José GC De Souza, Taisiya Glushkova, Duarte Alves, Luisa Coheur, and 1 others

  15. [15]

    InProceedings of the Seventh Confer- ence on Machine Translation (WMT), pages 634–

    Cometkiwi: Ist- unbabel 2022 submission for the quality estimation shared task. InProceedings of the Seventh Confer- ence on Machine Translation (WMT), pages 634–

  16. [16]

    InProceedings of the 2025 Con- ference on Empirical Methods in Natural Language Processing, pages 27275–27306, Suzhou, China

    Summarizing speech: A compre- hensive survey. InProceedings of the 2025 Con- ference on Empirical Methods in Natural Language Processing, pages 27275–27306, Suzhou, China. As- sociation for Computational Linguistics. Fabian Retkowski, Maike Züfle, Thai Binh Nguyen, Jan Niehues, and Alexander Waibel

  17. [17]

    Beyond Transcripts: A Renewed Perspective on Audio Chaptering

    Beyond tran- scripts: A renewed perspective on audio chaptering. Preprint, arXiv:2602.08979. Supriti Sinhamahapatra and Jan Niehues

  18. [18]

    InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 16111–16121

    Do slides help? multi-modal context for automatic tran- scription of conference talks. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 16111–16121. Changli Tang, Wenyi Yu, Guangzhi Sun, Xianzhao Chen, Tian Tan, Wei Li, Lu Lu, Zejun Ma, and Chao Zhang

  19. [19]

    InThe Twelfth International Conference on Learning Representa- tions, ICLR 2024, Vienna, Austria, May 7-11,

    SALMONN: towards generic hearing abilities for large language models. InThe Twelfth International Conference on Learning Representa- tions, ICLR 2024, Vienna, Austria, May 7-11,

  20. [20]

    Gemma 3 Technical Report

    Gemma 3 technical report. Preprint, arXiv:2503.19786. Ioannis Tsiamas, Gerard I. Gállego, José A. R. Fonol- losa, and Marta R. Costa-jussà

  21. [21]

    SHAS: Ap- proaching optimal Segmentation for End-to-End Speech Translation. InProc. Interspeech 2022, pages 106–110. Enes Yavuz Ugan, Ngoc-Quan Pham, and Alexander Waibel

  22. [22]

    Changhan Wang, Anne Wu, and Juan Pino

    Weight factorization and centralization for continual learning in speech recognition.arXiv preprint arXiv:2506.16574. Changhan Wang, Anne Wu, and Juan Pino

  23. [23]

    Covost 2 and massively mul- tilingual speech-to-text translation,

    Cov- ost 2: A massively multilingual speech-to-text trans- lation corpus.Preprint, arXiv:2007.10310. Dingdong Wang, Jincenzi Wu, Junan Li, Dongchao Yang, Xueyuan Chen, Tianhua Zhang, and Helen Meng

  24. [24]

    MMSU: A Massive Multi-task Spoken Language Understanding and Reasoning Benchmark

    Mmsu: A massive multi-task spoken language understanding and reasoning benchmark. arXiv preprint arXiv:2506.04779. Jin Xu, Zhifang Guo, Jinzheng He, Hangrui Hu, Ting He, Shuai Bai, Keqin Chen, Jialin Wang, Yang Fan, Kai Dang, Bin Zhang, Xiong Wang, Yunfei Chu, and Junyang Lin. 2025a. Qwen2.5-omni technical report. arXiv preprint arXiv:2503.20215. Jin Xu, ...

  25. [25]

    Yaowei Zheng, Richong Zhang, Junhao Zhang, Yanhan Ye, Zheyan Luo, Zhangchi Feng, and Yongqiang Ma

    Librisqa: Advancing free- form and open-ended spoken question answering with a novel dataset and framework.arXiv preprint arXiv:2308.10390. Yaowei Zheng, Richong Zhang, Junhao Zhang, Yanhan Ye, Zheyan Luo, Zhangchi Feng, and Yongqiang Ma

  26. [26]

    In Findings of the Association for Computational Lin- guistics: ACL 2025, pages 8469–8490, Vienna, Aus- tria

    Contrastive learn- ing for task-independent SpeechLLM-pretraining. In Findings of the Association for Computational Lin- guistics: ACL 2025, pages 8469–8490, Vienna, Aus- tria. Association for Computational Linguistics. Maike Züfle, Sara Papi, Beatrice Savoldi, Marco Gaido, Luisa Bentivogli, and Jan Niehues

  27. [27]

    The answer is

    NUT- SHELL: A dataset for abstract generation from scien- tific talks. InProceedings of the 22nd International Conference on Spoken Language Translation (IWSLT 2025), pages 19–32, Vienna, Austria (in-person and online). Association for Computational Linguistics. A Appendix B Data Prompts & InstructionsWe employ an LLM to generate paraphrased variants of m...

  28. [28]

    An overview of the re-ranking strategies and the number of model calls required for each is given in Tab

    for text-based comparison. An overview of the re-ranking strategies and the number of model calls required for each is given in Tab. A2. Method Description Model calls Likelihood Each candidate is scored independently by computing logp(candidate|audio) with an external model. The candidate with the highest likelihood is selected. N Comparison The model re...

  29. [29]

    Results per language are shown in Tab

    to evaluate our re-ranking methods using the official MCIF evaluation code. Results per language are shown in Tab. A3 (en), Tab. A4 (de), Tab. A5 (it), and Tab. A6 (zh). Re-ranking yields consistent improvements only for English, where Likelihood achieves the largest gain, driven primarily by a strong improvement on ASR. For Chinese, modest improvements a...

  30. [30]

    This granular view helps identify how language-specific challenges—morphological complexity, data scarcity, and pre-training bias—affect different tasks and languages

    excel on specific languages. This granular view helps identify how language-specific challenges—morphological complexity, data scarcity, and pre-training bias—affect different tasks and languages. EN ZH DE IT ID MethodSQA SSUM ASR SQA SSUM ST SQA SSUM ST SQA SSUM ST Baselines (1) Omni 24.93 19.34 53.40 43.03 9.59 75.88 28.34 11.55 61.16 26.82 16.35 68.92 ...