Do What I Say: A Spoken Prompt Dataset for Instruction-Following

Alexander Waibel; Fabian Retkowski; Jan Niehues; Luisa Bentivogli; Maike Z\"ufle; Marek Kasztelnik; Sara Papi; Szymon Mazurek

arxiv: 2603.09881 · v2 · submitted 2026-03-10 · 💻 cs.CL

Do What I Say: A Spoken Prompt Dataset for Instruction-Following

Maike Z\"ufle , Sara Papi , Fabian Retkowski , Szymon Mazurek , Marek Kasztelnik , Alexander Waibel , Luisa Bentivogli , Jan Niehues This is my paper

Pith reviewed 2026-05-15 13:13 UTC · model grok-4.3

classification 💻 cs.CL

keywords spoken promptsspeech large language modelsinstruction followingmultilingual datasetprompt modalitySLLM evaluationbenchmarking

0 comments

The pith

Text prompts outperform spoken prompts when testing speech language models, except on tasks that require speech output.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces DoWhatISay, a dataset of human-recorded spoken prompts paired with text versions for nine tasks across eleven languages. It benchmarks current speech large language models and shows text prompts work better overall, with the gap widening for low-resource languages and cross-lingual use. Spoken prompts only close the performance difference when the model must generate speech as its answer. This setup matters because real users speak to these models rather than type, so current text-based tests may overestimate how well the models handle everyday spoken instructions.

Core claim

We introduce DoWhatISay (DOWIS), a multilingual dataset of human-recorded spoken and written prompts designed to pair with any existing benchmark for realistic evaluation of SLLMs under spoken instruction conditions. Spanning 9 tasks and 11 languages, it provides 10 prompt variants per task-language pair across five styles. Benchmarking state-of-the-art SLLMs shows that text prompts consistently outperform spoken prompts, particularly for low-resource and cross-lingual settings. Only for tasks with speech output do spoken prompts close the gap.

What carries the argument

The DOWIS dataset, which supplies paired human-recorded spoken and text prompts for the same instructions to enable direct comparison of prompt modality effects on SLLM performance.

If this is right

SLLM benchmarks should include spoken prompt conditions to avoid overestimating model capabilities in realistic use.
Low-resource language performance drops more sharply with spoken input, indicating a priority area for model improvement.
Speech-output tasks benefit from modality-matched prompting, suggesting targeted training on aligned input-output speech.
Cross-lingual settings amplify the spoken-prompt penalty, pointing to weaker transfer when instructions arrive as audio.
Prompt style variations interact with modality, so evaluation sets need multiple styles to capture full robustness.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Model developers could add explicit speech-input training objectives to narrow the gap observed in most tasks.
The dataset enables future work on prompt engineering techniques that compensate for spoken input weaknesses.
Real deployments might route spoken instructions through a text transcription step for non-speech tasks to retain higher accuracy.
Extending the dataset with more spontaneous, noisy recordings would test whether the current gap widens or narrows under messier conditions.

Load-bearing premise

The recorded spoken prompts and selected task-language-style combinations reflect the speech patterns and variations that real users would produce when interacting with these models.

What would settle it

A study in which actual users give spoken instructions to SLLMs on non-speech tasks and achieve performance equal to or better than text prompts would show the outperformance claim does not hold.

Figures

Figures reproduced from arXiv: 2603.09881 by Alexander Waibel, Fabian Retkowski, Jan Niehues, Luisa Bentivogli, Maike Z\"ufle, Marek Kasztelnik, Sara Papi, Szymon Mazurek.

**Figure 1.** Figure 1: Performance comparison for Qwen: Text Prompt vs Speech Prompts with respect to different target languages. Positive values (purple) indicate text prompt performs better, negative values indicate speech prompts perform better. Task Metric Model Text Speech Speech Prompt Prompt Prompt Male Fem. ASR WER ↓ Phi* 16.69 332.41 402.77 271.15 Qwen 12.60 17.08 13.77 14.91 SQA BERTS. ↑ Phi 36.49 11.16 11.37 10.96 Qwe… view at source ↗

**Figure 2.** Figure 2: Performance comparison for Qwen2.5-Omni: Text Prompt vs Speech Prompts with respect to different prompt types. Positive values (purple) indicate text prompt performs better, negative values (yellow) indicate speech prompts perform better. Task Model Prompt Type Task Model Prompt Type Basic Formal Inform. Detail. Short Basic Formal Inform. Detail. Short ASR↓ Phi* 187.49 199.13 274.56 173.05 251.39 MT↑ Phi 6… view at source ↗

read the original abstract

Speech Large Language Models (SLLMs) have rapidly expanded, supporting a wide range of tasks. These models are typically evaluated using text prompts, which may not reflect real-world scenarios where users interact with speech. To address this gap, we introduce DoWhatISay (DOWIS), a multilingual dataset of human-recorded spoken and written prompts designed to pair with any existing benchmark for realistic evaluation of SLLMs under spoken instruction conditions. Spanning 9 tasks and 11 languages, it provides 10 prompt variants per task-language pair, across five styles. Using DOWIS, we benchmark state-of-the-art SLLMs, analyzing the interplay between prompt modality, style, language, and task type. Results show that text prompts consistently outperform spoken prompts, particularly for low-resource and cross-lingual settings. Only for tasks with speech output, spoken prompts do close the gap, highlighting the need for speech-based prompting in SLLM evaluation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's real value is the new DOWIS dataset for spoken-prompt evaluation of SLLMs, paired with benchmarks across languages and styles, though the modality-gap claims rest on controlled recordings that may not match messy real use.

read the letter

The main thing to know is that this paper ships a new multilingual dataset called DoWhatISay that supplies spoken and text prompt variants for existing benchmarks. Their runs on current SLLMs show text prompts beating spoken ones in most cases, especially low-resource and cross-lingual settings, with spoken prompts only closing the gap on tasks that require speech output. That pattern is the concrete finding worth checking. The dataset itself covers 9 tasks, 11 languages, 10 variants per task-language pair, and five styles, which is a straightforward way to move evaluation closer to how people actually talk to these models. That pairing with existing benchmarks is practical and fills a clear hole in the current literature. The benchmarking section gives initial numbers on how modality interacts with language and task type, which is useful even if the absolute scores will shift as models improve. The soft spot is the recording protocol. The prompts are human-recorded in controlled styles, so they lack the disfluencies, prosody variation, accents, and background conditions that show up in real user speech. If those factors change the results, the reported text-over-spoken advantage could be an artifact of the clean setup rather than a general property of the models. The paper would be tighter with some validation against in-the-wild recordings or at least more detail on speaker diversity and environment. This is aimed at groups working on speech LLMs who need better evaluation resources. Anyone running SLLM experiments or building spoken interfaces will get immediate use from the dataset itself. The work shows clear thinking about the evaluation mismatch and honest empirical runs, so it deserves a serious referee even if the representativeness question needs more attention in revision.

Referee Report

3 major / 2 minor

Summary. The paper introduces DoWhatISay (DOWIS), a multilingual dataset of human-recorded spoken and written prompts designed to pair with existing benchmarks for evaluating Speech Large Language Models (SLLMs). Spanning 9 tasks and 11 languages with 10 prompt variants per task-language-style combination across five styles, the work benchmarks state-of-the-art SLLMs and reports that text prompts consistently outperform spoken prompts (especially in low-resource and cross-lingual settings), with spoken prompts closing the performance gap only for tasks involving speech output.

Significance. If the dataset proves representative, this provides a useful new resource for realistic SLLM evaluation and demonstrates concrete modality effects that could guide future model development and prompting strategies in multilingual and speech-centric scenarios. The empirical nature of the comparisons (new runs on existing models) adds direct evidence without circularity.

major comments (3)

[Dataset construction] Dataset construction section: the paper provides no details on speaker pool size, demographics, accent diversity, recording environment (e.g., noise levels, equipment), quality assurance protocols, or any validation against in-the-wild spoken interactions. This directly undermines the central claim that observed text-vs-spoken gaps reflect intrinsic SLLM properties rather than artifacts of the controlled recording protocol.
[Results and analysis] Results and analysis section: exact evaluation metrics (e.g., accuracy, F1, or task-specific scores) are not specified, nor are statistical significance tests or confidence intervals reported for the modality differences. Without these, the strength of the finding that text outperforms spoken (particularly low-resource/cross-lingual) cannot be fully assessed.
[Task and language selection] Task and language selection: it is unclear how the specific 9 tasks and 11 languages were chosen or whether they include sufficient low-resource examples to support the generalization that text prompts are superior in those regimes; this choice is load-bearing for the cross-lingual claims.

minor comments (2)

[Abstract] The abstract states 'five styles' without naming them; this should be defined in the abstract or introduction for immediate clarity.
[Figures and tables] Figure captions and table headers could more explicitly indicate which columns correspond to spoken vs. text prompt conditions to aid quick reading.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below and will revise the paper to provide the requested clarifications and details.

read point-by-point responses

Referee: [Dataset construction] Dataset construction section: the paper provides no details on speaker pool size, demographics, accent diversity, recording environment (e.g., noise levels, equipment), quality assurance protocols, or any validation against in-the-wild spoken interactions. This directly undermines the central claim that observed text-vs-spoken gaps reflect intrinsic SLLM properties rather than artifacts of the controlled recording protocol.

Authors: We agree that additional details on dataset construction are needed. In the revised manuscript, we will expand the relevant section to include speaker pool size and demographics, accent diversity, recording environment specifications (including equipment and noise levels), quality assurance protocols, and a discussion of limitations relative to in-the-wild interactions. This will better support interpretation of the modality gaps while noting the controlled nature of the recordings for consistent comparison. revision: yes
Referee: [Results and analysis] Results and analysis section: exact evaluation metrics (e.g., accuracy, F1, or task-specific scores) are not specified, nor are statistical significance tests or confidence intervals reported for the modality differences. Without these, the strength of the finding that text outperforms spoken (particularly low-resource/cross-lingual) cannot be fully assessed.

Authors: We will revise the Results and Analysis section to explicitly define the evaluation metrics for each task and include statistical significance tests with confidence intervals for the reported modality differences. This will allow for a more rigorous assessment of the findings. revision: yes
Referee: [Task and language selection] Task and language selection: it is unclear how the specific 9 tasks and 11 languages were chosen or whether they include sufficient low-resource examples to support the generalization that text prompts are superior in those regimes; this choice is load-bearing for the cross-lingual claims.

Authors: We will add a subsection detailing the selection criteria for the 9 tasks and 11 languages, including justification for coverage of low-resource languages to support the cross-lingual claims. This will make the rationale transparent. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical dataset evaluation

full rationale

The paper introduces the DOWIS dataset of spoken/written prompts and reports direct benchmarking results on existing SLLMs. All claims (text prompts outperforming spoken ones except on speech-output tasks) follow from new empirical runs without equations, fitted parameters renamed as predictions, or self-citation chains that reduce the central findings to prior inputs by construction. The evaluation is self-contained against external model benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the dataset serving as a valid proxy for real spoken instructions, resting on standard assumptions about human recording fidelity and task representativeness rather than new free parameters or invented entities.

axioms (1)

domain assumption Human-recorded spoken prompts accurately reflect natural user speech patterns and intent.
Invoked to position the dataset as a realistic evaluation tool for spoken conditions.

pith-pipeline@v0.9.0 · 5491 in / 1123 out tokens · 59182 ms · 2026-05-15T13:13:11.229888+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · 1 internal anchor

[1]

Summarise this meeting

Introduction Speech Large Language Models (SLLMs) have seen remark- able progress in recent years, demonstrating strong performance across both speech and text tasks [1, 2]. A key capability of these models is instruction-following (IF): rather than requiring a special tag or argument to specify a task, they can be guided through natural language prompts ...

work page
[2]

and Uro-Bench [17] are among the few benchmarks with spoken instructions, but both have notable limitations. First, their instructions are generated using text-to-speech systems, cover only English and Chinese, and are pre-concatenated with task-specific inputs, making them impossible to reuse with other datasets. Second, they focus on general instruction...

work page
[3]

Do What I Say: A Spoken Prompt Dataset for Instruction-Following

and Qwen2.5-Omni [21], using DOWIS. We find that for tasks with text output, text prompts significantly overestimate the models’ performance in contrast to spoken prompts, while for tasks with speech output, such as text-to-speech synthe- sis or speech-to-speech translation, spoken prompts perform on par or better. Regarding prompt style, informal text an...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[4]

Translate what was said in this pre- sentation

The DOWIS Prompt Dataset We introduce DOWIS, the first multilingual speech prompt dataset comprising parallel spoken and textual instructions for nine speech and language processing tasks across 11 languages. DOWIS can be combined with any existing downstream task benchmarks and, therefore, designed to facilitate multifaceted evaluation of instruction-fol...

work page
[5]

Both mod- els are run with default inference parameters and batch size 1 on a single NVIDIA A100-SXM4-40GB GPU

Experiments Models.We selectQwen2.5-Omni-7B 2 [21] and Phi-4-multimodal-instruct3 [20], two state-of-the- art models (hereafterQwenandPhirespectively), to analyze the impact of different prompt types and modalities. Both mod- els are run with default inference parameters and batch size 1 on a single NVIDIA A100-SXM4-40GB GPU. For each task, we evaluate fi...

work page
[6]

For speech-output tasks (TTS and S2ST), we first tran- scribe the generated audio usingwhisper-large-v3 5 [30]

with thedeberta-xlarge-mnli 4 [29] model to mea- sure semantic similarity between generated and reference an- swers. For speech-output tasks (TTS and S2ST), we first tran- scribe the generated audio usingwhisper-large-v3 5 [30]. We then report WER for TTS and CometKiwi for S2ST to eval- uate content accuracy. To assess speech quality, we additionally repo...

work page
[7]

text prompts (Section 4.1), and (2) how much prompt types influence their performance (Section 4.2)

Analysis We analyse models’ performance on the DOWIS prompts along two dimensions: (1) how well state-of-the-art SLLMs perform 4 microsoft/deberta-xlarge-mnli 5 openai/whisper-large-v3 on speech vs. text prompts (Section 4.1), and (2) how much prompt types influence their performance (Section 4.2). 4.1. Impact of Text vs. Speech Prompts General Trends.Tab...

work page
[8]

and compute WER against the reference prompt texts. With generally high intelligibility across all prompts, we find no clear pattern between prompt WER and model performance, for example, TSUM prompts have similarly low transcription WER for both genders (12% for both), yet a performance gap persists (BERTScore 43.88 vs. 42.93). This suggests that prompt ...

work page
[9]

Conclusion We introduced DOWIS, the first human-recorded paral- lel spoken-textual prompt dataset for evaluating spoken instruction-following in SLLMs. Covering nine tasks, 11 lan- guages, and five prompt styles, DOWIS can be easily combined with any task-specific benchmark to enable more realistic and comprehensive instruction-following SLLMs evaluation....

work page
[10]

Generative AI Use Disclosure Claude was employed exclusively to correct grammar in content authored by humans and in writing code to design paper’s plots

work page
[11]

How is AI Chang- ing Science? Research in the Era of Learning Algorithms

Acknowledgements This work has received funding from the European Union’s Horizon research and innovation programme under grant agree- ment No 101135798, project Meetween (My Personal AI Me- diator for Virtual MEETtings BetWEEN People). This re- search is also supported by the project “How is AI Chang- ing Science? Research in the Era of Learning Algorith...

work page 2025
[12]

Qwen3-omni technical report,

J. Xuet al., “Qwen3-omni technical report,” 2025

work page 2025
[13]

On the landscape of spoken language models: A comprehensive survey,

S. Aroraet al., “On the landscape of spoken language models: A comprehensive survey,” 2025

work page 2025
[14]

Language models are few-shot learners,

T. Brownet al., “Language models are few-shot learners,” H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, Eds., vol. 33. Curran Associates, Inc., 2020, pp. 1877–1901

work page 2020
[15]

PandaGPT: One model to instruction-follow them all,

Y . Suet al., “PandaGPT: One model to instruction-follow them all,” inProceedings of the 1st Workshop on Taming Large Lan- guage Models: Controllability in the era of Interactive Assis- tants!, D. Hazarika, X. R. Tang, and D. Jin, Eds. Prague, Czech Republic: Association for Computational Linguistics, Sep. 2023, pp. 11–23

work page 2023
[16]

SALMONN: Towards generic hearing abilities for large language models,

C. Tanget al., “SALMONN: Towards generic hearing abilities for large language models,” inICLR, 2024

work page 2024
[17]

Speech-IFEval: Evaluating Instruction- Following and Quantifying Catastrophic Forgetting in Speech- Aware Language Models,

K.-H. Luet al., “Speech-IFEval: Evaluating Instruction- Following and Quantifying Catastrophic Forgetting in Speech- Aware Language Models,” inInterspeech 2025, 2025, pp. 2078– 2082

work page 2025
[18]

AIR-bench: Benchmarking large audio-language models via generative comprehension,

Q. Yanget al., “AIR-bench: Benchmarking large audio-language models via generative comprehension,” inProceedings of the 62nd Annual Meeting of the Association for Computational Lin- guistics (Volume 1: Long Papers), L.-W. Ku, A. Martins, and V . Srikumar, Eds. Bangkok, Thailand: Association for Com- putational Linguistics, Aug. 2024, pp. 1979–1998

work page 2024
[19]

Mmsu: A massive multi-task spoken language understanding and reasoning benchmark,

D. Wanget al., “Mmsu: A massive multi-task spoken language understanding and reasoning benchmark,” 2026

work page 2026
[20]

Benchmarking open-ended audio dialogue un- derstanding for large audio-language models,

K. Gaoet al., “Benchmarking open-ended audio dialogue un- derstanding for large audio-language models,” inProceedings of the 63rd Annual Meeting of the Association for Computa- tional Linguistics (Volume 1: Long Papers), W. Che, J. Nabende, E. Shutova, and M. T. Pilehvar, Eds. Vienna, Austria: Associa- tion for Computational Linguistics, Jul. 2025, pp. 4763–4784

work page 2025
[21]

Dynamic-SUPERB phase-2: A collabora- tively expanding benchmark for measuring the capabilities of spo- ken language models with 180 tasks,

C. yu Huanget al., “Dynamic-SUPERB phase-2: A collabora- tively expanding benchmark for measuring the capabilities of spo- ken language models with 180 tasks,” inICLR, 2025

work page 2025
[22]

SIFT-50M: A large-scale multilingual dataset for speech instruction fine-tuning,

P. Pandeyet al., “SIFT-50M: A large-scale multilingual dataset for speech instruction fine-tuning,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), W. Che, J. Nabende, E. Shutova, and M. T. Pilehvar, Eds. Vienna, Austria: Association for Computa- tional Linguistics, Jul. 2025, pp. 13 921–13 942

work page 2025
[23]

MCIF: Multimodal crosslingual instruction- following benchmark from scientific talks,

S. Papiet al., “MCIF: Multimodal crosslingual instruction- following benchmark from scientific talks,” inICLR, 2026

work page 2026
[24]

V oicebench: Benchmarking llm-based voice as- sistants,

Y . Chenet al., “V oicebench: Benchmarking llm-based voice as- sistants,” 2024

work page 2024
[25]

V oiceassistant-eval: Benchmarking ai assistants across listening, speaking, and viewing,

K. Wanget al., “V oiceassistant-eval: Benchmarking ai assistants across listening, speaking, and viewing,” 2025

work page 2025
[26]

SAKURA: On the Multi-hop Reasoning of Large Audio-Language Models Based on Speech and Audio In- formation,

C.-K. Yanget al., “SAKURA: On the Multi-hop Reasoning of Large Audio-Language Models Based on Speech and Audio In- formation,” inInterspeech 2025, 2025, pp. 1788–1792

work page 2025
[27]

InSerter: Speech instruction following with un- supervised interleaved pre-training,

D. Wanget al., “InSerter: Speech instruction following with un- supervised interleaved pre-training,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), W. Che, J. Nabende, E. Shutova, and M. T. Pilehvar, Eds. Vienna, Austria: Association for Computa- tional Linguistics, Jul. 2025, pp. 18 024–18 046

work page 2025
[28]

URO-bench: Towards comprehensive evalua- tion for end-to-end spoken dialogue models,

R. Yanet al., “URO-bench: Towards comprehensive evalua- tion for end-to-end spoken dialogue models,” inFindings of the Association for Computational Linguistics: EMNLP 2025, C. Christodoulopoulos, T. Chakraborty, C. Rose, and V . Peng, Eds. Suzhou, China: Association for Computational Linguis- tics, Nov. 2025, pp. 17 211–17 242

work page 2025
[29]

Spokennativqa: Multilingual everyday spoken queries for llms,

F. Alamet al., “Spokennativqa: Multilingual everyday spoken queries for llms,” 2025

work page 2025
[30]

Summarizing speech: A comprehensive survey,

F. Retkowskiet al., “Summarizing speech: A comprehensive survey,” inProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, C. Christodoulopoulos, T. Chakraborty, C. Rose, and V . Peng, Eds. Suzhou, China: As- sociation for Computational Linguistics, Nov. 2025, pp. 27 275– 27 306

work page 2025
[31]

Phi-4-mini technical report: Compact yet powerful multimodal language models via mixture-of-loras,

Microsoft, “Phi-4-mini technical report: Compact yet powerful multimodal language models via mixture-of-loras,” 2025

work page 2025
[32]

Qwen2.5-omni technical report,

J. Xuet al., “Qwen2.5-omni technical report,” 2025

work page 2025
[33]

Fleurs: Few-shot learning evaluation of uni- versal representations of speech,

A. Conneauet al., “Fleurs: Few-shot learning evaluation of uni- versal representations of speech,” inIEEE SLT, 2023

work page 2023
[34]

From Text Segmentation to Smart Chapter- ing: A Novel Benchmark for Structuring Video Transcriptions,

F. Retkowskiet al., “From Text Segmentation to Smart Chapter- ing: A Novel Benchmark for Structuring Video Transcriptions,” inEACL, 2024

work page 2024
[35]

From WER and RIL to MER and WIL: im- proved evaluation measures for connected speech recognition,

A. C. Morriset al., “From WER and RIL to MER and WIL: im- proved evaluation measures for connected speech recognition,” in Interspeech, 2004

work page 2004
[36]

CometKiwi: IST-unbabel 2022 submission for the quality estimation shared task,

R. Reiet al., “CometKiwi: IST-unbabel 2022 submission for the quality estimation shared task,” inProc. WMT, 2022

work page 2022
[37]

Are LLMs breaking MT metrics? results of the WMT24 metrics shared task,

M. Freitaget al., “Are LLMs breaking MT metrics? results of the WMT24 metrics shared task,” inProceedings of the Ninth Con- ference on Machine Translation, B. Haddow, T. Kocmi, P. Koehn, and C. Monz, Eds. Miami, Florida, USA: Association for Com- putational Linguistics, Nov. 2024, pp. 47–81

work page 2024
[38]

Hearing to Translate: The Effectiveness of Speech Modality Integration into LLMs,

S. Papiet al., “Hearing to Translate: The Effectiveness of Speech Modality Integration into LLMs,” 2025

work page 2025
[39]

Bertscore: Evaluating text generation with BERT,

T. Zhanget al., “Bertscore: Evaluating text generation with BERT,” inICLR, 2020

work page 2020
[40]

Deberta: Decoding-enhanced bert with disentangled attention,

P. Heet al., “Deberta: Decoding-enhanced bert with disentangled attention,” inICLR, 2021

work page 2021
[41]

Robust speech recognition via large-scale weak supervision,

A. Radfordet al., “Robust speech recognition via large-scale weak supervision,” inProc. ICML, 2023

work page 2023
[42]

UTMOS: UTokyo-SaruLab System for V oice- MOS Challenge 2022,

T. Saekiet al., “UTMOS: UTokyo-SaruLab System for V oice- MOS Challenge 2022,” inInterspeech, 2022

work page 2022
[43]

Beyond transcripts: A renewed perspective on audio chaptering,

F. Retkowskiet al., “Beyond transcripts: A renewed perspective on audio chaptering,” 2026

work page 2026
[44]

Twists, humps, and pebbles: Multilingual speech recognition models exhibit gender performance gaps,

G. Attanasioet al., “Twists, humps, and pebbles: Multilingual speech recognition models exhibit gender performance gaps,” in Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Miami, Florida, USA, Nov. 2024, pp. 21 318–21 340

work page 2024

[1] [1]

Summarise this meeting

Introduction Speech Large Language Models (SLLMs) have seen remark- able progress in recent years, demonstrating strong performance across both speech and text tasks [1, 2]. A key capability of these models is instruction-following (IF): rather than requiring a special tag or argument to specify a task, they can be guided through natural language prompts ...

work page

[2] [2]

and Uro-Bench [17] are among the few benchmarks with spoken instructions, but both have notable limitations. First, their instructions are generated using text-to-speech systems, cover only English and Chinese, and are pre-concatenated with task-specific inputs, making them impossible to reuse with other datasets. Second, they focus on general instruction...

work page

[3] [3]

Do What I Say: A Spoken Prompt Dataset for Instruction-Following

and Qwen2.5-Omni [21], using DOWIS. We find that for tasks with text output, text prompts significantly overestimate the models’ performance in contrast to spoken prompts, while for tasks with speech output, such as text-to-speech synthe- sis or speech-to-speech translation, spoken prompts perform on par or better. Regarding prompt style, informal text an...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[4] [4]

Translate what was said in this pre- sentation

The DOWIS Prompt Dataset We introduce DOWIS, the first multilingual speech prompt dataset comprising parallel spoken and textual instructions for nine speech and language processing tasks across 11 languages. DOWIS can be combined with any existing downstream task benchmarks and, therefore, designed to facilitate multifaceted evaluation of instruction-fol...

work page

[5] [5]

Both mod- els are run with default inference parameters and batch size 1 on a single NVIDIA A100-SXM4-40GB GPU

Experiments Models.We selectQwen2.5-Omni-7B 2 [21] and Phi-4-multimodal-instruct3 [20], two state-of-the- art models (hereafterQwenandPhirespectively), to analyze the impact of different prompt types and modalities. Both mod- els are run with default inference parameters and batch size 1 on a single NVIDIA A100-SXM4-40GB GPU. For each task, we evaluate fi...

work page

[6] [6]

For speech-output tasks (TTS and S2ST), we first tran- scribe the generated audio usingwhisper-large-v3 5 [30]

with thedeberta-xlarge-mnli 4 [29] model to mea- sure semantic similarity between generated and reference an- swers. For speech-output tasks (TTS and S2ST), we first tran- scribe the generated audio usingwhisper-large-v3 5 [30]. We then report WER for TTS and CometKiwi for S2ST to eval- uate content accuracy. To assess speech quality, we additionally repo...

work page

[7] [7]

text prompts (Section 4.1), and (2) how much prompt types influence their performance (Section 4.2)

Analysis We analyse models’ performance on the DOWIS prompts along two dimensions: (1) how well state-of-the-art SLLMs perform 4 microsoft/deberta-xlarge-mnli 5 openai/whisper-large-v3 on speech vs. text prompts (Section 4.1), and (2) how much prompt types influence their performance (Section 4.2). 4.1. Impact of Text vs. Speech Prompts General Trends.Tab...

work page

[8] [8]

and compute WER against the reference prompt texts. With generally high intelligibility across all prompts, we find no clear pattern between prompt WER and model performance, for example, TSUM prompts have similarly low transcription WER for both genders (12% for both), yet a performance gap persists (BERTScore 43.88 vs. 42.93). This suggests that prompt ...

work page

[9] [9]

Conclusion We introduced DOWIS, the first human-recorded paral- lel spoken-textual prompt dataset for evaluating spoken instruction-following in SLLMs. Covering nine tasks, 11 lan- guages, and five prompt styles, DOWIS can be easily combined with any task-specific benchmark to enable more realistic and comprehensive instruction-following SLLMs evaluation....

work page

[10] [10]

Generative AI Use Disclosure Claude was employed exclusively to correct grammar in content authored by humans and in writing code to design paper’s plots

work page

[11] [11]

How is AI Chang- ing Science? Research in the Era of Learning Algorithms

Acknowledgements This work has received funding from the European Union’s Horizon research and innovation programme under grant agree- ment No 101135798, project Meetween (My Personal AI Me- diator for Virtual MEETtings BetWEEN People). This re- search is also supported by the project “How is AI Chang- ing Science? Research in the Era of Learning Algorith...

work page 2025

[12] [12]

Qwen3-omni technical report,

J. Xuet al., “Qwen3-omni technical report,” 2025

work page 2025

[13] [13]

On the landscape of spoken language models: A comprehensive survey,

S. Aroraet al., “On the landscape of spoken language models: A comprehensive survey,” 2025

work page 2025

[14] [14]

Language models are few-shot learners,

T. Brownet al., “Language models are few-shot learners,” H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, Eds., vol. 33. Curran Associates, Inc., 2020, pp. 1877–1901

work page 2020

[15] [15]

PandaGPT: One model to instruction-follow them all,

Y . Suet al., “PandaGPT: One model to instruction-follow them all,” inProceedings of the 1st Workshop on Taming Large Lan- guage Models: Controllability in the era of Interactive Assis- tants!, D. Hazarika, X. R. Tang, and D. Jin, Eds. Prague, Czech Republic: Association for Computational Linguistics, Sep. 2023, pp. 11–23

work page 2023

[16] [16]

SALMONN: Towards generic hearing abilities for large language models,

C. Tanget al., “SALMONN: Towards generic hearing abilities for large language models,” inICLR, 2024

work page 2024

[17] [17]

Speech-IFEval: Evaluating Instruction- Following and Quantifying Catastrophic Forgetting in Speech- Aware Language Models,

K.-H. Luet al., “Speech-IFEval: Evaluating Instruction- Following and Quantifying Catastrophic Forgetting in Speech- Aware Language Models,” inInterspeech 2025, 2025, pp. 2078– 2082

work page 2025

[18] [18]

AIR-bench: Benchmarking large audio-language models via generative comprehension,

Q. Yanget al., “AIR-bench: Benchmarking large audio-language models via generative comprehension,” inProceedings of the 62nd Annual Meeting of the Association for Computational Lin- guistics (Volume 1: Long Papers), L.-W. Ku, A. Martins, and V . Srikumar, Eds. Bangkok, Thailand: Association for Com- putational Linguistics, Aug. 2024, pp. 1979–1998

work page 2024

[19] [19]

Mmsu: A massive multi-task spoken language understanding and reasoning benchmark,

D. Wanget al., “Mmsu: A massive multi-task spoken language understanding and reasoning benchmark,” 2026

work page 2026

[20] [20]

Benchmarking open-ended audio dialogue un- derstanding for large audio-language models,

K. Gaoet al., “Benchmarking open-ended audio dialogue un- derstanding for large audio-language models,” inProceedings of the 63rd Annual Meeting of the Association for Computa- tional Linguistics (Volume 1: Long Papers), W. Che, J. Nabende, E. Shutova, and M. T. Pilehvar, Eds. Vienna, Austria: Associa- tion for Computational Linguistics, Jul. 2025, pp. 4763–4784

work page 2025

[21] [21]

Dynamic-SUPERB phase-2: A collabora- tively expanding benchmark for measuring the capabilities of spo- ken language models with 180 tasks,

C. yu Huanget al., “Dynamic-SUPERB phase-2: A collabora- tively expanding benchmark for measuring the capabilities of spo- ken language models with 180 tasks,” inICLR, 2025

work page 2025

[22] [22]

SIFT-50M: A large-scale multilingual dataset for speech instruction fine-tuning,

P. Pandeyet al., “SIFT-50M: A large-scale multilingual dataset for speech instruction fine-tuning,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), W. Che, J. Nabende, E. Shutova, and M. T. Pilehvar, Eds. Vienna, Austria: Association for Computa- tional Linguistics, Jul. 2025, pp. 13 921–13 942

work page 2025

[23] [23]

MCIF: Multimodal crosslingual instruction- following benchmark from scientific talks,

S. Papiet al., “MCIF: Multimodal crosslingual instruction- following benchmark from scientific talks,” inICLR, 2026

work page 2026

[24] [24]

V oicebench: Benchmarking llm-based voice as- sistants,

Y . Chenet al., “V oicebench: Benchmarking llm-based voice as- sistants,” 2024

work page 2024

[25] [25]

V oiceassistant-eval: Benchmarking ai assistants across listening, speaking, and viewing,

K. Wanget al., “V oiceassistant-eval: Benchmarking ai assistants across listening, speaking, and viewing,” 2025

work page 2025

[26] [26]

SAKURA: On the Multi-hop Reasoning of Large Audio-Language Models Based on Speech and Audio In- formation,

C.-K. Yanget al., “SAKURA: On the Multi-hop Reasoning of Large Audio-Language Models Based on Speech and Audio In- formation,” inInterspeech 2025, 2025, pp. 1788–1792

work page 2025

[27] [27]

InSerter: Speech instruction following with un- supervised interleaved pre-training,

D. Wanget al., “InSerter: Speech instruction following with un- supervised interleaved pre-training,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), W. Che, J. Nabende, E. Shutova, and M. T. Pilehvar, Eds. Vienna, Austria: Association for Computa- tional Linguistics, Jul. 2025, pp. 18 024–18 046

work page 2025

[28] [28]

URO-bench: Towards comprehensive evalua- tion for end-to-end spoken dialogue models,

R. Yanet al., “URO-bench: Towards comprehensive evalua- tion for end-to-end spoken dialogue models,” inFindings of the Association for Computational Linguistics: EMNLP 2025, C. Christodoulopoulos, T. Chakraborty, C. Rose, and V . Peng, Eds. Suzhou, China: Association for Computational Linguis- tics, Nov. 2025, pp. 17 211–17 242

work page 2025

[29] [29]

Spokennativqa: Multilingual everyday spoken queries for llms,

F. Alamet al., “Spokennativqa: Multilingual everyday spoken queries for llms,” 2025

work page 2025

[30] [30]

Summarizing speech: A comprehensive survey,

F. Retkowskiet al., “Summarizing speech: A comprehensive survey,” inProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, C. Christodoulopoulos, T. Chakraborty, C. Rose, and V . Peng, Eds. Suzhou, China: As- sociation for Computational Linguistics, Nov. 2025, pp. 27 275– 27 306

work page 2025

[31] [31]

Phi-4-mini technical report: Compact yet powerful multimodal language models via mixture-of-loras,

Microsoft, “Phi-4-mini technical report: Compact yet powerful multimodal language models via mixture-of-loras,” 2025

work page 2025

[32] [32]

Qwen2.5-omni technical report,

J. Xuet al., “Qwen2.5-omni technical report,” 2025

work page 2025

[33] [33]

Fleurs: Few-shot learning evaluation of uni- versal representations of speech,

A. Conneauet al., “Fleurs: Few-shot learning evaluation of uni- versal representations of speech,” inIEEE SLT, 2023

work page 2023

[34] [34]

From Text Segmentation to Smart Chapter- ing: A Novel Benchmark for Structuring Video Transcriptions,

F. Retkowskiet al., “From Text Segmentation to Smart Chapter- ing: A Novel Benchmark for Structuring Video Transcriptions,” inEACL, 2024

work page 2024

[35] [35]

From WER and RIL to MER and WIL: im- proved evaluation measures for connected speech recognition,

A. C. Morriset al., “From WER and RIL to MER and WIL: im- proved evaluation measures for connected speech recognition,” in Interspeech, 2004

work page 2004

[36] [36]

CometKiwi: IST-unbabel 2022 submission for the quality estimation shared task,

R. Reiet al., “CometKiwi: IST-unbabel 2022 submission for the quality estimation shared task,” inProc. WMT, 2022

work page 2022

[37] [37]

Are LLMs breaking MT metrics? results of the WMT24 metrics shared task,

M. Freitaget al., “Are LLMs breaking MT metrics? results of the WMT24 metrics shared task,” inProceedings of the Ninth Con- ference on Machine Translation, B. Haddow, T. Kocmi, P. Koehn, and C. Monz, Eds. Miami, Florida, USA: Association for Com- putational Linguistics, Nov. 2024, pp. 47–81

work page 2024

[38] [38]

Hearing to Translate: The Effectiveness of Speech Modality Integration into LLMs,

S. Papiet al., “Hearing to Translate: The Effectiveness of Speech Modality Integration into LLMs,” 2025

work page 2025

[39] [39]

Bertscore: Evaluating text generation with BERT,

T. Zhanget al., “Bertscore: Evaluating text generation with BERT,” inICLR, 2020

work page 2020

[40] [40]

Deberta: Decoding-enhanced bert with disentangled attention,

P. Heet al., “Deberta: Decoding-enhanced bert with disentangled attention,” inICLR, 2021

work page 2021

[41] [41]

Robust speech recognition via large-scale weak supervision,

A. Radfordet al., “Robust speech recognition via large-scale weak supervision,” inProc. ICML, 2023

work page 2023

[42] [42]

UTMOS: UTokyo-SaruLab System for V oice- MOS Challenge 2022,

T. Saekiet al., “UTMOS: UTokyo-SaruLab System for V oice- MOS Challenge 2022,” inInterspeech, 2022

work page 2022

[43] [43]

Beyond transcripts: A renewed perspective on audio chaptering,

F. Retkowskiet al., “Beyond transcripts: A renewed perspective on audio chaptering,” 2026

work page 2026

[44] [44]

Twists, humps, and pebbles: Multilingual speech recognition models exhibit gender performance gaps,

G. Attanasioet al., “Twists, humps, and pebbles: Multilingual speech recognition models exhibit gender performance gaps,” in Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Miami, Florida, USA, Nov. 2024, pp. 21 318–21 340

work page 2024