pith. sign in

arxiv: 2603.09881 · v2 · submitted 2026-03-10 · 💻 cs.CL

Do What I Say: A Spoken Prompt Dataset for Instruction-Following

Pith reviewed 2026-05-15 13:13 UTC · model grok-4.3

classification 💻 cs.CL
keywords spoken promptsspeech large language modelsinstruction followingmultilingual datasetprompt modalitySLLM evaluationbenchmarking
0
0 comments X

The pith

Text prompts outperform spoken prompts when testing speech language models, except on tasks that require speech output.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces DoWhatISay, a dataset of human-recorded spoken prompts paired with text versions for nine tasks across eleven languages. It benchmarks current speech large language models and shows text prompts work better overall, with the gap widening for low-resource languages and cross-lingual use. Spoken prompts only close the performance difference when the model must generate speech as its answer. This setup matters because real users speak to these models rather than type, so current text-based tests may overestimate how well the models handle everyday spoken instructions.

Core claim

We introduce DoWhatISay (DOWIS), a multilingual dataset of human-recorded spoken and written prompts designed to pair with any existing benchmark for realistic evaluation of SLLMs under spoken instruction conditions. Spanning 9 tasks and 11 languages, it provides 10 prompt variants per task-language pair across five styles. Benchmarking state-of-the-art SLLMs shows that text prompts consistently outperform spoken prompts, particularly for low-resource and cross-lingual settings. Only for tasks with speech output do spoken prompts close the gap.

What carries the argument

The DOWIS dataset, which supplies paired human-recorded spoken and text prompts for the same instructions to enable direct comparison of prompt modality effects on SLLM performance.

If this is right

  • SLLM benchmarks should include spoken prompt conditions to avoid overestimating model capabilities in realistic use.
  • Low-resource language performance drops more sharply with spoken input, indicating a priority area for model improvement.
  • Speech-output tasks benefit from modality-matched prompting, suggesting targeted training on aligned input-output speech.
  • Cross-lingual settings amplify the spoken-prompt penalty, pointing to weaker transfer when instructions arrive as audio.
  • Prompt style variations interact with modality, so evaluation sets need multiple styles to capture full robustness.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Model developers could add explicit speech-input training objectives to narrow the gap observed in most tasks.
  • The dataset enables future work on prompt engineering techniques that compensate for spoken input weaknesses.
  • Real deployments might route spoken instructions through a text transcription step for non-speech tasks to retain higher accuracy.
  • Extending the dataset with more spontaneous, noisy recordings would test whether the current gap widens or narrows under messier conditions.

Load-bearing premise

The recorded spoken prompts and selected task-language-style combinations reflect the speech patterns and variations that real users would produce when interacting with these models.

What would settle it

A study in which actual users give spoken instructions to SLLMs on non-speech tasks and achieve performance equal to or better than text prompts would show the outperformance claim does not hold.

Figures

Figures reproduced from arXiv: 2603.09881 by Alexander Waibel, Fabian Retkowski, Jan Niehues, Luisa Bentivogli, Maike Z\"ufle, Marek Kasztelnik, Sara Papi, Szymon Mazurek.

Figure 1
Figure 1. Figure 1: Performance comparison for Qwen: Text Prompt vs Speech Prompts with respect to different target languages. Positive values (purple) indicate text prompt performs better, negative values indicate speech prompts perform better. Task Metric Model Text Speech Speech Prompt Prompt Prompt Male Fem. ASR WER ↓ Phi* 16.69 332.41 402.77 271.15 Qwen 12.60 17.08 13.77 14.91 SQA BERTS. ↑ Phi 36.49 11.16 11.37 10.96 Qwe… view at source ↗
Figure 2
Figure 2. Figure 2: Performance comparison for Qwen2.5-Omni: Text Prompt vs Speech Prompts with respect to different prompt types. Positive values (purple) indicate text prompt performs better, negative values (yellow) indicate speech prompts perform better. Task Model Prompt Type Task Model Prompt Type Basic Formal Inform. Detail. Short Basic Formal Inform. Detail. Short ASR↓ Phi* 187.49 199.13 274.56 173.05 251.39 MT↑ Phi 6… view at source ↗
read the original abstract

Speech Large Language Models (SLLMs) have rapidly expanded, supporting a wide range of tasks. These models are typically evaluated using text prompts, which may not reflect real-world scenarios where users interact with speech. To address this gap, we introduce DoWhatISay (DOWIS), a multilingual dataset of human-recorded spoken and written prompts designed to pair with any existing benchmark for realistic evaluation of SLLMs under spoken instruction conditions. Spanning 9 tasks and 11 languages, it provides 10 prompt variants per task-language pair, across five styles. Using DOWIS, we benchmark state-of-the-art SLLMs, analyzing the interplay between prompt modality, style, language, and task type. Results show that text prompts consistently outperform spoken prompts, particularly for low-resource and cross-lingual settings. Only for tasks with speech output, spoken prompts do close the gap, highlighting the need for speech-based prompting in SLLM evaluation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces DoWhatISay (DOWIS), a multilingual dataset of human-recorded spoken and written prompts designed to pair with existing benchmarks for evaluating Speech Large Language Models (SLLMs). Spanning 9 tasks and 11 languages with 10 prompt variants per task-language-style combination across five styles, the work benchmarks state-of-the-art SLLMs and reports that text prompts consistently outperform spoken prompts (especially in low-resource and cross-lingual settings), with spoken prompts closing the performance gap only for tasks involving speech output.

Significance. If the dataset proves representative, this provides a useful new resource for realistic SLLM evaluation and demonstrates concrete modality effects that could guide future model development and prompting strategies in multilingual and speech-centric scenarios. The empirical nature of the comparisons (new runs on existing models) adds direct evidence without circularity.

major comments (3)
  1. [Dataset construction] Dataset construction section: the paper provides no details on speaker pool size, demographics, accent diversity, recording environment (e.g., noise levels, equipment), quality assurance protocols, or any validation against in-the-wild spoken interactions. This directly undermines the central claim that observed text-vs-spoken gaps reflect intrinsic SLLM properties rather than artifacts of the controlled recording protocol.
  2. [Results and analysis] Results and analysis section: exact evaluation metrics (e.g., accuracy, F1, or task-specific scores) are not specified, nor are statistical significance tests or confidence intervals reported for the modality differences. Without these, the strength of the finding that text outperforms spoken (particularly low-resource/cross-lingual) cannot be fully assessed.
  3. [Task and language selection] Task and language selection: it is unclear how the specific 9 tasks and 11 languages were chosen or whether they include sufficient low-resource examples to support the generalization that text prompts are superior in those regimes; this choice is load-bearing for the cross-lingual claims.
minor comments (2)
  1. [Abstract] The abstract states 'five styles' without naming them; this should be defined in the abstract or introduction for immediate clarity.
  2. [Figures and tables] Figure captions and table headers could more explicitly indicate which columns correspond to spoken vs. text prompt conditions to aid quick reading.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below and will revise the paper to provide the requested clarifications and details.

read point-by-point responses
  1. Referee: [Dataset construction] Dataset construction section: the paper provides no details on speaker pool size, demographics, accent diversity, recording environment (e.g., noise levels, equipment), quality assurance protocols, or any validation against in-the-wild spoken interactions. This directly undermines the central claim that observed text-vs-spoken gaps reflect intrinsic SLLM properties rather than artifacts of the controlled recording protocol.

    Authors: We agree that additional details on dataset construction are needed. In the revised manuscript, we will expand the relevant section to include speaker pool size and demographics, accent diversity, recording environment specifications (including equipment and noise levels), quality assurance protocols, and a discussion of limitations relative to in-the-wild interactions. This will better support interpretation of the modality gaps while noting the controlled nature of the recordings for consistent comparison. revision: yes

  2. Referee: [Results and analysis] Results and analysis section: exact evaluation metrics (e.g., accuracy, F1, or task-specific scores) are not specified, nor are statistical significance tests or confidence intervals reported for the modality differences. Without these, the strength of the finding that text outperforms spoken (particularly low-resource/cross-lingual) cannot be fully assessed.

    Authors: We will revise the Results and Analysis section to explicitly define the evaluation metrics for each task and include statistical significance tests with confidence intervals for the reported modality differences. This will allow for a more rigorous assessment of the findings. revision: yes

  3. Referee: [Task and language selection] Task and language selection: it is unclear how the specific 9 tasks and 11 languages were chosen or whether they include sufficient low-resource examples to support the generalization that text prompts are superior in those regimes; this choice is load-bearing for the cross-lingual claims.

    Authors: We will add a subsection detailing the selection criteria for the 9 tasks and 11 languages, including justification for coverage of low-resource languages to support the cross-lingual claims. This will make the rationale transparent. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical dataset evaluation

full rationale

The paper introduces the DOWIS dataset of spoken/written prompts and reports direct benchmarking results on existing SLLMs. All claims (text prompts outperforming spoken ones except on speech-output tasks) follow from new empirical runs without equations, fitted parameters renamed as predictions, or self-citation chains that reduce the central findings to prior inputs by construction. The evaluation is self-contained against external model benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the dataset serving as a valid proxy for real spoken instructions, resting on standard assumptions about human recording fidelity and task representativeness rather than new free parameters or invented entities.

axioms (1)
  • domain assumption Human-recorded spoken prompts accurately reflect natural user speech patterns and intent.
    Invoked to position the dataset as a realistic evaluation tool for spoken conditions.

pith-pipeline@v0.9.0 · 5491 in / 1123 out tokens · 59182 ms · 2026-05-15T13:13:11.229888+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · 1 internal anchor

  1. [1]

    Summarise this meeting

    Introduction Speech Large Language Models (SLLMs) have seen remark- able progress in recent years, demonstrating strong performance across both speech and text tasks [1, 2]. A key capability of these models is instruction-following (IF): rather than requiring a special tag or argument to specify a task, they can be guided through natural language prompts ...

  2. [2]

    and Uro-Bench [17] are among the few benchmarks with spoken instructions, but both have notable limitations. First, their instructions are generated using text-to-speech systems, cover only English and Chinese, and are pre-concatenated with task-specific inputs, making them impossible to reuse with other datasets. Second, they focus on general instruction...

  3. [3]

    Do What I Say: A Spoken Prompt Dataset for Instruction-Following

    and Qwen2.5-Omni [21], using DOWIS. We find that for tasks with text output, text prompts significantly overestimate the models’ performance in contrast to spoken prompts, while for tasks with speech output, such as text-to-speech synthe- sis or speech-to-speech translation, spoken prompts perform on par or better. Regarding prompt style, informal text an...

  4. [4]

    Translate what was said in this pre- sentation

    The DOWIS Prompt Dataset We introduce DOWIS, the first multilingual speech prompt dataset comprising parallel spoken and textual instructions for nine speech and language processing tasks across 11 languages. DOWIS can be combined with any existing downstream task benchmarks and, therefore, designed to facilitate multifaceted evaluation of instruction-fol...

  5. [5]

    Both mod- els are run with default inference parameters and batch size 1 on a single NVIDIA A100-SXM4-40GB GPU

    Experiments Models.We selectQwen2.5-Omni-7B 2 [21] and Phi-4-multimodal-instruct3 [20], two state-of-the- art models (hereafterQwenandPhirespectively), to analyze the impact of different prompt types and modalities. Both mod- els are run with default inference parameters and batch size 1 on a single NVIDIA A100-SXM4-40GB GPU. For each task, we evaluate fi...

  6. [6]

    For speech-output tasks (TTS and S2ST), we first tran- scribe the generated audio usingwhisper-large-v3 5 [30]

    with thedeberta-xlarge-mnli 4 [29] model to mea- sure semantic similarity between generated and reference an- swers. For speech-output tasks (TTS and S2ST), we first tran- scribe the generated audio usingwhisper-large-v3 5 [30]. We then report WER for TTS and CometKiwi for S2ST to eval- uate content accuracy. To assess speech quality, we additionally repo...

  7. [7]

    text prompts (Section 4.1), and (2) how much prompt types influence their performance (Section 4.2)

    Analysis We analyse models’ performance on the DOWIS prompts along two dimensions: (1) how well state-of-the-art SLLMs perform 4 microsoft/deberta-xlarge-mnli 5 openai/whisper-large-v3 on speech vs. text prompts (Section 4.1), and (2) how much prompt types influence their performance (Section 4.2). 4.1. Impact of Text vs. Speech Prompts General Trends.Tab...

  8. [8]

    and compute WER against the reference prompt texts. With generally high intelligibility across all prompts, we find no clear pattern between prompt WER and model performance, for example, TSUM prompts have similarly low transcription WER for both genders (12% for both), yet a performance gap persists (BERTScore 43.88 vs. 42.93). This suggests that prompt ...

  9. [9]

    Conclusion We introduced DOWIS, the first human-recorded paral- lel spoken-textual prompt dataset for evaluating spoken instruction-following in SLLMs. Covering nine tasks, 11 lan- guages, and five prompt styles, DOWIS can be easily combined with any task-specific benchmark to enable more realistic and comprehensive instruction-following SLLMs evaluation....

  10. [10]

    Generative AI Use Disclosure Claude was employed exclusively to correct grammar in content authored by humans and in writing code to design paper’s plots

  11. [11]

    How is AI Chang- ing Science? Research in the Era of Learning Algorithms

    Acknowledgements This work has received funding from the European Union’s Horizon research and innovation programme under grant agree- ment No 101135798, project Meetween (My Personal AI Me- diator for Virtual MEETtings BetWEEN People). This re- search is also supported by the project “How is AI Chang- ing Science? Research in the Era of Learning Algorith...

  12. [12]

    Qwen3-omni technical report,

    J. Xuet al., “Qwen3-omni technical report,” 2025

  13. [13]

    On the landscape of spoken language models: A comprehensive survey,

    S. Aroraet al., “On the landscape of spoken language models: A comprehensive survey,” 2025

  14. [14]

    Language models are few-shot learners,

    T. Brownet al., “Language models are few-shot learners,” H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, Eds., vol. 33. Curran Associates, Inc., 2020, pp. 1877–1901

  15. [15]

    PandaGPT: One model to instruction-follow them all,

    Y . Suet al., “PandaGPT: One model to instruction-follow them all,” inProceedings of the 1st Workshop on Taming Large Lan- guage Models: Controllability in the era of Interactive Assis- tants!, D. Hazarika, X. R. Tang, and D. Jin, Eds. Prague, Czech Republic: Association for Computational Linguistics, Sep. 2023, pp. 11–23

  16. [16]

    SALMONN: Towards generic hearing abilities for large language models,

    C. Tanget al., “SALMONN: Towards generic hearing abilities for large language models,” inICLR, 2024

  17. [17]

    Speech-IFEval: Evaluating Instruction- Following and Quantifying Catastrophic Forgetting in Speech- Aware Language Models,

    K.-H. Luet al., “Speech-IFEval: Evaluating Instruction- Following and Quantifying Catastrophic Forgetting in Speech- Aware Language Models,” inInterspeech 2025, 2025, pp. 2078– 2082

  18. [18]

    AIR-bench: Benchmarking large audio-language models via generative comprehension,

    Q. Yanget al., “AIR-bench: Benchmarking large audio-language models via generative comprehension,” inProceedings of the 62nd Annual Meeting of the Association for Computational Lin- guistics (Volume 1: Long Papers), L.-W. Ku, A. Martins, and V . Srikumar, Eds. Bangkok, Thailand: Association for Com- putational Linguistics, Aug. 2024, pp. 1979–1998

  19. [19]

    Mmsu: A massive multi-task spoken language understanding and reasoning benchmark,

    D. Wanget al., “Mmsu: A massive multi-task spoken language understanding and reasoning benchmark,” 2026

  20. [20]

    Benchmarking open-ended audio dialogue un- derstanding for large audio-language models,

    K. Gaoet al., “Benchmarking open-ended audio dialogue un- derstanding for large audio-language models,” inProceedings of the 63rd Annual Meeting of the Association for Computa- tional Linguistics (Volume 1: Long Papers), W. Che, J. Nabende, E. Shutova, and M. T. Pilehvar, Eds. Vienna, Austria: Associa- tion for Computational Linguistics, Jul. 2025, pp. 4763–4784

  21. [21]

    Dynamic-SUPERB phase-2: A collabora- tively expanding benchmark for measuring the capabilities of spo- ken language models with 180 tasks,

    C. yu Huanget al., “Dynamic-SUPERB phase-2: A collabora- tively expanding benchmark for measuring the capabilities of spo- ken language models with 180 tasks,” inICLR, 2025

  22. [22]

    SIFT-50M: A large-scale multilingual dataset for speech instruction fine-tuning,

    P. Pandeyet al., “SIFT-50M: A large-scale multilingual dataset for speech instruction fine-tuning,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), W. Che, J. Nabende, E. Shutova, and M. T. Pilehvar, Eds. Vienna, Austria: Association for Computa- tional Linguistics, Jul. 2025, pp. 13 921–13 942

  23. [23]

    MCIF: Multimodal crosslingual instruction- following benchmark from scientific talks,

    S. Papiet al., “MCIF: Multimodal crosslingual instruction- following benchmark from scientific talks,” inICLR, 2026

  24. [24]

    V oicebench: Benchmarking llm-based voice as- sistants,

    Y . Chenet al., “V oicebench: Benchmarking llm-based voice as- sistants,” 2024

  25. [25]

    V oiceassistant-eval: Benchmarking ai assistants across listening, speaking, and viewing,

    K. Wanget al., “V oiceassistant-eval: Benchmarking ai assistants across listening, speaking, and viewing,” 2025

  26. [26]

    SAKURA: On the Multi-hop Reasoning of Large Audio-Language Models Based on Speech and Audio In- formation,

    C.-K. Yanget al., “SAKURA: On the Multi-hop Reasoning of Large Audio-Language Models Based on Speech and Audio In- formation,” inInterspeech 2025, 2025, pp. 1788–1792

  27. [27]

    InSerter: Speech instruction following with un- supervised interleaved pre-training,

    D. Wanget al., “InSerter: Speech instruction following with un- supervised interleaved pre-training,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), W. Che, J. Nabende, E. Shutova, and M. T. Pilehvar, Eds. Vienna, Austria: Association for Computa- tional Linguistics, Jul. 2025, pp. 18 024–18 046

  28. [28]

    URO-bench: Towards comprehensive evalua- tion for end-to-end spoken dialogue models,

    R. Yanet al., “URO-bench: Towards comprehensive evalua- tion for end-to-end spoken dialogue models,” inFindings of the Association for Computational Linguistics: EMNLP 2025, C. Christodoulopoulos, T. Chakraborty, C. Rose, and V . Peng, Eds. Suzhou, China: Association for Computational Linguis- tics, Nov. 2025, pp. 17 211–17 242

  29. [29]

    Spokennativqa: Multilingual everyday spoken queries for llms,

    F. Alamet al., “Spokennativqa: Multilingual everyday spoken queries for llms,” 2025

  30. [30]

    Summarizing speech: A comprehensive survey,

    F. Retkowskiet al., “Summarizing speech: A comprehensive survey,” inProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, C. Christodoulopoulos, T. Chakraborty, C. Rose, and V . Peng, Eds. Suzhou, China: As- sociation for Computational Linguistics, Nov. 2025, pp. 27 275– 27 306

  31. [31]

    Phi-4-mini technical report: Compact yet powerful multimodal language models via mixture-of-loras,

    Microsoft, “Phi-4-mini technical report: Compact yet powerful multimodal language models via mixture-of-loras,” 2025

  32. [32]

    Qwen2.5-omni technical report,

    J. Xuet al., “Qwen2.5-omni technical report,” 2025

  33. [33]

    Fleurs: Few-shot learning evaluation of uni- versal representations of speech,

    A. Conneauet al., “Fleurs: Few-shot learning evaluation of uni- versal representations of speech,” inIEEE SLT, 2023

  34. [34]

    From Text Segmentation to Smart Chapter- ing: A Novel Benchmark for Structuring Video Transcriptions,

    F. Retkowskiet al., “From Text Segmentation to Smart Chapter- ing: A Novel Benchmark for Structuring Video Transcriptions,” inEACL, 2024

  35. [35]

    From WER and RIL to MER and WIL: im- proved evaluation measures for connected speech recognition,

    A. C. Morriset al., “From WER and RIL to MER and WIL: im- proved evaluation measures for connected speech recognition,” in Interspeech, 2004

  36. [36]

    CometKiwi: IST-unbabel 2022 submission for the quality estimation shared task,

    R. Reiet al., “CometKiwi: IST-unbabel 2022 submission for the quality estimation shared task,” inProc. WMT, 2022

  37. [37]

    Are LLMs breaking MT metrics? results of the WMT24 metrics shared task,

    M. Freitaget al., “Are LLMs breaking MT metrics? results of the WMT24 metrics shared task,” inProceedings of the Ninth Con- ference on Machine Translation, B. Haddow, T. Kocmi, P. Koehn, and C. Monz, Eds. Miami, Florida, USA: Association for Com- putational Linguistics, Nov. 2024, pp. 47–81

  38. [38]

    Hearing to Translate: The Effectiveness of Speech Modality Integration into LLMs,

    S. Papiet al., “Hearing to Translate: The Effectiveness of Speech Modality Integration into LLMs,” 2025

  39. [39]

    Bertscore: Evaluating text generation with BERT,

    T. Zhanget al., “Bertscore: Evaluating text generation with BERT,” inICLR, 2020

  40. [40]

    Deberta: Decoding-enhanced bert with disentangled attention,

    P. Heet al., “Deberta: Decoding-enhanced bert with disentangled attention,” inICLR, 2021

  41. [41]

    Robust speech recognition via large-scale weak supervision,

    A. Radfordet al., “Robust speech recognition via large-scale weak supervision,” inProc. ICML, 2023

  42. [42]

    UTMOS: UTokyo-SaruLab System for V oice- MOS Challenge 2022,

    T. Saekiet al., “UTMOS: UTokyo-SaruLab System for V oice- MOS Challenge 2022,” inInterspeech, 2022

  43. [43]

    Beyond transcripts: A renewed perspective on audio chaptering,

    F. Retkowskiet al., “Beyond transcripts: A renewed perspective on audio chaptering,” 2026

  44. [44]

    Twists, humps, and pebbles: Multilingual speech recognition models exhibit gender performance gaps,

    G. Attanasioet al., “Twists, humps, and pebbles: Multilingual speech recognition models exhibit gender performance gaps,” in Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Miami, Florida, USA, Nov. 2024, pp. 21 318–21 340