Do What I Say: A Spoken Prompt Dataset for Instruction-Following
Pith reviewed 2026-05-15 13:13 UTC · model grok-4.3
The pith
Text prompts outperform spoken prompts when testing speech language models, except on tasks that require speech output.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We introduce DoWhatISay (DOWIS), a multilingual dataset of human-recorded spoken and written prompts designed to pair with any existing benchmark for realistic evaluation of SLLMs under spoken instruction conditions. Spanning 9 tasks and 11 languages, it provides 10 prompt variants per task-language pair across five styles. Benchmarking state-of-the-art SLLMs shows that text prompts consistently outperform spoken prompts, particularly for low-resource and cross-lingual settings. Only for tasks with speech output do spoken prompts close the gap.
What carries the argument
The DOWIS dataset, which supplies paired human-recorded spoken and text prompts for the same instructions to enable direct comparison of prompt modality effects on SLLM performance.
If this is right
- SLLM benchmarks should include spoken prompt conditions to avoid overestimating model capabilities in realistic use.
- Low-resource language performance drops more sharply with spoken input, indicating a priority area for model improvement.
- Speech-output tasks benefit from modality-matched prompting, suggesting targeted training on aligned input-output speech.
- Cross-lingual settings amplify the spoken-prompt penalty, pointing to weaker transfer when instructions arrive as audio.
- Prompt style variations interact with modality, so evaluation sets need multiple styles to capture full robustness.
Where Pith is reading between the lines
- Model developers could add explicit speech-input training objectives to narrow the gap observed in most tasks.
- The dataset enables future work on prompt engineering techniques that compensate for spoken input weaknesses.
- Real deployments might route spoken instructions through a text transcription step for non-speech tasks to retain higher accuracy.
- Extending the dataset with more spontaneous, noisy recordings would test whether the current gap widens or narrows under messier conditions.
Load-bearing premise
The recorded spoken prompts and selected task-language-style combinations reflect the speech patterns and variations that real users would produce when interacting with these models.
What would settle it
A study in which actual users give spoken instructions to SLLMs on non-speech tasks and achieve performance equal to or better than text prompts would show the outperformance claim does not hold.
Figures
read the original abstract
Speech Large Language Models (SLLMs) have rapidly expanded, supporting a wide range of tasks. These models are typically evaluated using text prompts, which may not reflect real-world scenarios where users interact with speech. To address this gap, we introduce DoWhatISay (DOWIS), a multilingual dataset of human-recorded spoken and written prompts designed to pair with any existing benchmark for realistic evaluation of SLLMs under spoken instruction conditions. Spanning 9 tasks and 11 languages, it provides 10 prompt variants per task-language pair, across five styles. Using DOWIS, we benchmark state-of-the-art SLLMs, analyzing the interplay between prompt modality, style, language, and task type. Results show that text prompts consistently outperform spoken prompts, particularly for low-resource and cross-lingual settings. Only for tasks with speech output, spoken prompts do close the gap, highlighting the need for speech-based prompting in SLLM evaluation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces DoWhatISay (DOWIS), a multilingual dataset of human-recorded spoken and written prompts designed to pair with existing benchmarks for evaluating Speech Large Language Models (SLLMs). Spanning 9 tasks and 11 languages with 10 prompt variants per task-language-style combination across five styles, the work benchmarks state-of-the-art SLLMs and reports that text prompts consistently outperform spoken prompts (especially in low-resource and cross-lingual settings), with spoken prompts closing the performance gap only for tasks involving speech output.
Significance. If the dataset proves representative, this provides a useful new resource for realistic SLLM evaluation and demonstrates concrete modality effects that could guide future model development and prompting strategies in multilingual and speech-centric scenarios. The empirical nature of the comparisons (new runs on existing models) adds direct evidence without circularity.
major comments (3)
- [Dataset construction] Dataset construction section: the paper provides no details on speaker pool size, demographics, accent diversity, recording environment (e.g., noise levels, equipment), quality assurance protocols, or any validation against in-the-wild spoken interactions. This directly undermines the central claim that observed text-vs-spoken gaps reflect intrinsic SLLM properties rather than artifacts of the controlled recording protocol.
- [Results and analysis] Results and analysis section: exact evaluation metrics (e.g., accuracy, F1, or task-specific scores) are not specified, nor are statistical significance tests or confidence intervals reported for the modality differences. Without these, the strength of the finding that text outperforms spoken (particularly low-resource/cross-lingual) cannot be fully assessed.
- [Task and language selection] Task and language selection: it is unclear how the specific 9 tasks and 11 languages were chosen or whether they include sufficient low-resource examples to support the generalization that text prompts are superior in those regimes; this choice is load-bearing for the cross-lingual claims.
minor comments (2)
- [Abstract] The abstract states 'five styles' without naming them; this should be defined in the abstract or introduction for immediate clarity.
- [Figures and tables] Figure captions and table headers could more explicitly indicate which columns correspond to spoken vs. text prompt conditions to aid quick reading.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below and will revise the paper to provide the requested clarifications and details.
read point-by-point responses
-
Referee: [Dataset construction] Dataset construction section: the paper provides no details on speaker pool size, demographics, accent diversity, recording environment (e.g., noise levels, equipment), quality assurance protocols, or any validation against in-the-wild spoken interactions. This directly undermines the central claim that observed text-vs-spoken gaps reflect intrinsic SLLM properties rather than artifacts of the controlled recording protocol.
Authors: We agree that additional details on dataset construction are needed. In the revised manuscript, we will expand the relevant section to include speaker pool size and demographics, accent diversity, recording environment specifications (including equipment and noise levels), quality assurance protocols, and a discussion of limitations relative to in-the-wild interactions. This will better support interpretation of the modality gaps while noting the controlled nature of the recordings for consistent comparison. revision: yes
-
Referee: [Results and analysis] Results and analysis section: exact evaluation metrics (e.g., accuracy, F1, or task-specific scores) are not specified, nor are statistical significance tests or confidence intervals reported for the modality differences. Without these, the strength of the finding that text outperforms spoken (particularly low-resource/cross-lingual) cannot be fully assessed.
Authors: We will revise the Results and Analysis section to explicitly define the evaluation metrics for each task and include statistical significance tests with confidence intervals for the reported modality differences. This will allow for a more rigorous assessment of the findings. revision: yes
-
Referee: [Task and language selection] Task and language selection: it is unclear how the specific 9 tasks and 11 languages were chosen or whether they include sufficient low-resource examples to support the generalization that text prompts are superior in those regimes; this choice is load-bearing for the cross-lingual claims.
Authors: We will add a subsection detailing the selection criteria for the 9 tasks and 11 languages, including justification for coverage of low-resource languages to support the cross-lingual claims. This will make the rationale transparent. revision: yes
Circularity Check
No significant circularity in empirical dataset evaluation
full rationale
The paper introduces the DOWIS dataset of spoken/written prompts and reports direct benchmarking results on existing SLLMs. All claims (text prompts outperforming spoken ones except on speech-output tasks) follow from new empirical runs without equations, fitted parameters renamed as predictions, or self-citation chains that reduce the central findings to prior inputs by construction. The evaluation is self-contained against external model benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Human-recorded spoken prompts accurately reflect natural user speech patterns and intent.
Reference graph
Works this paper leans on
-
[1]
Introduction Speech Large Language Models (SLLMs) have seen remark- able progress in recent years, demonstrating strong performance across both speech and text tasks [1, 2]. A key capability of these models is instruction-following (IF): rather than requiring a special tag or argument to specify a task, they can be guided through natural language prompts ...
-
[2]
and Uro-Bench [17] are among the few benchmarks with spoken instructions, but both have notable limitations. First, their instructions are generated using text-to-speech systems, cover only English and Chinese, and are pre-concatenated with task-specific inputs, making them impossible to reuse with other datasets. Second, they focus on general instruction...
-
[3]
Do What I Say: A Spoken Prompt Dataset for Instruction-Following
and Qwen2.5-Omni [21], using DOWIS. We find that for tasks with text output, text prompts significantly overestimate the models’ performance in contrast to spoken prompts, while for tasks with speech output, such as text-to-speech synthe- sis or speech-to-speech translation, spoken prompts perform on par or better. Regarding prompt style, informal text an...
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[4]
Translate what was said in this pre- sentation
The DOWIS Prompt Dataset We introduce DOWIS, the first multilingual speech prompt dataset comprising parallel spoken and textual instructions for nine speech and language processing tasks across 11 languages. DOWIS can be combined with any existing downstream task benchmarks and, therefore, designed to facilitate multifaceted evaluation of instruction-fol...
-
[5]
Experiments Models.We selectQwen2.5-Omni-7B 2 [21] and Phi-4-multimodal-instruct3 [20], two state-of-the- art models (hereafterQwenandPhirespectively), to analyze the impact of different prompt types and modalities. Both mod- els are run with default inference parameters and batch size 1 on a single NVIDIA A100-SXM4-40GB GPU. For each task, we evaluate fi...
-
[6]
with thedeberta-xlarge-mnli 4 [29] model to mea- sure semantic similarity between generated and reference an- swers. For speech-output tasks (TTS and S2ST), we first tran- scribe the generated audio usingwhisper-large-v3 5 [30]. We then report WER for TTS and CometKiwi for S2ST to eval- uate content accuracy. To assess speech quality, we additionally repo...
-
[7]
text prompts (Section 4.1), and (2) how much prompt types influence their performance (Section 4.2)
Analysis We analyse models’ performance on the DOWIS prompts along two dimensions: (1) how well state-of-the-art SLLMs perform 4 microsoft/deberta-xlarge-mnli 5 openai/whisper-large-v3 on speech vs. text prompts (Section 4.1), and (2) how much prompt types influence their performance (Section 4.2). 4.1. Impact of Text vs. Speech Prompts General Trends.Tab...
-
[8]
and compute WER against the reference prompt texts. With generally high intelligibility across all prompts, we find no clear pattern between prompt WER and model performance, for example, TSUM prompts have similarly low transcription WER for both genders (12% for both), yet a performance gap persists (BERTScore 43.88 vs. 42.93). This suggests that prompt ...
-
[9]
Conclusion We introduced DOWIS, the first human-recorded paral- lel spoken-textual prompt dataset for evaluating spoken instruction-following in SLLMs. Covering nine tasks, 11 lan- guages, and five prompt styles, DOWIS can be easily combined with any task-specific benchmark to enable more realistic and comprehensive instruction-following SLLMs evaluation....
-
[10]
Generative AI Use Disclosure Claude was employed exclusively to correct grammar in content authored by humans and in writing code to design paper’s plots
-
[11]
How is AI Chang- ing Science? Research in the Era of Learning Algorithms
Acknowledgements This work has received funding from the European Union’s Horizon research and innovation programme under grant agree- ment No 101135798, project Meetween (My Personal AI Me- diator for Virtual MEETtings BetWEEN People). This re- search is also supported by the project “How is AI Chang- ing Science? Research in the Era of Learning Algorith...
work page 2025
- [12]
-
[13]
On the landscape of spoken language models: A comprehensive survey,
S. Aroraet al., “On the landscape of spoken language models: A comprehensive survey,” 2025
work page 2025
-
[14]
Language models are few-shot learners,
T. Brownet al., “Language models are few-shot learners,” H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, Eds., vol. 33. Curran Associates, Inc., 2020, pp. 1877–1901
work page 2020
-
[15]
PandaGPT: One model to instruction-follow them all,
Y . Suet al., “PandaGPT: One model to instruction-follow them all,” inProceedings of the 1st Workshop on Taming Large Lan- guage Models: Controllability in the era of Interactive Assis- tants!, D. Hazarika, X. R. Tang, and D. Jin, Eds. Prague, Czech Republic: Association for Computational Linguistics, Sep. 2023, pp. 11–23
work page 2023
-
[16]
SALMONN: Towards generic hearing abilities for large language models,
C. Tanget al., “SALMONN: Towards generic hearing abilities for large language models,” inICLR, 2024
work page 2024
-
[17]
K.-H. Luet al., “Speech-IFEval: Evaluating Instruction- Following and Quantifying Catastrophic Forgetting in Speech- Aware Language Models,” inInterspeech 2025, 2025, pp. 2078– 2082
work page 2025
-
[18]
AIR-bench: Benchmarking large audio-language models via generative comprehension,
Q. Yanget al., “AIR-bench: Benchmarking large audio-language models via generative comprehension,” inProceedings of the 62nd Annual Meeting of the Association for Computational Lin- guistics (Volume 1: Long Papers), L.-W. Ku, A. Martins, and V . Srikumar, Eds. Bangkok, Thailand: Association for Com- putational Linguistics, Aug. 2024, pp. 1979–1998
work page 2024
-
[19]
Mmsu: A massive multi-task spoken language understanding and reasoning benchmark,
D. Wanget al., “Mmsu: A massive multi-task spoken language understanding and reasoning benchmark,” 2026
work page 2026
-
[20]
Benchmarking open-ended audio dialogue un- derstanding for large audio-language models,
K. Gaoet al., “Benchmarking open-ended audio dialogue un- derstanding for large audio-language models,” inProceedings of the 63rd Annual Meeting of the Association for Computa- tional Linguistics (Volume 1: Long Papers), W. Che, J. Nabende, E. Shutova, and M. T. Pilehvar, Eds. Vienna, Austria: Associa- tion for Computational Linguistics, Jul. 2025, pp. 4763–4784
work page 2025
-
[21]
C. yu Huanget al., “Dynamic-SUPERB phase-2: A collabora- tively expanding benchmark for measuring the capabilities of spo- ken language models with 180 tasks,” inICLR, 2025
work page 2025
-
[22]
SIFT-50M: A large-scale multilingual dataset for speech instruction fine-tuning,
P. Pandeyet al., “SIFT-50M: A large-scale multilingual dataset for speech instruction fine-tuning,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), W. Che, J. Nabende, E. Shutova, and M. T. Pilehvar, Eds. Vienna, Austria: Association for Computa- tional Linguistics, Jul. 2025, pp. 13 921–13 942
work page 2025
-
[23]
MCIF: Multimodal crosslingual instruction- following benchmark from scientific talks,
S. Papiet al., “MCIF: Multimodal crosslingual instruction- following benchmark from scientific talks,” inICLR, 2026
work page 2026
-
[24]
V oicebench: Benchmarking llm-based voice as- sistants,
Y . Chenet al., “V oicebench: Benchmarking llm-based voice as- sistants,” 2024
work page 2024
-
[25]
V oiceassistant-eval: Benchmarking ai assistants across listening, speaking, and viewing,
K. Wanget al., “V oiceassistant-eval: Benchmarking ai assistants across listening, speaking, and viewing,” 2025
work page 2025
-
[26]
C.-K. Yanget al., “SAKURA: On the Multi-hop Reasoning of Large Audio-Language Models Based on Speech and Audio In- formation,” inInterspeech 2025, 2025, pp. 1788–1792
work page 2025
-
[27]
InSerter: Speech instruction following with un- supervised interleaved pre-training,
D. Wanget al., “InSerter: Speech instruction following with un- supervised interleaved pre-training,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), W. Che, J. Nabende, E. Shutova, and M. T. Pilehvar, Eds. Vienna, Austria: Association for Computa- tional Linguistics, Jul. 2025, pp. 18 024–18 046
work page 2025
-
[28]
URO-bench: Towards comprehensive evalua- tion for end-to-end spoken dialogue models,
R. Yanet al., “URO-bench: Towards comprehensive evalua- tion for end-to-end spoken dialogue models,” inFindings of the Association for Computational Linguistics: EMNLP 2025, C. Christodoulopoulos, T. Chakraborty, C. Rose, and V . Peng, Eds. Suzhou, China: Association for Computational Linguis- tics, Nov. 2025, pp. 17 211–17 242
work page 2025
-
[29]
Spokennativqa: Multilingual everyday spoken queries for llms,
F. Alamet al., “Spokennativqa: Multilingual everyday spoken queries for llms,” 2025
work page 2025
-
[30]
Summarizing speech: A comprehensive survey,
F. Retkowskiet al., “Summarizing speech: A comprehensive survey,” inProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, C. Christodoulopoulos, T. Chakraborty, C. Rose, and V . Peng, Eds. Suzhou, China: As- sociation for Computational Linguistics, Nov. 2025, pp. 27 275– 27 306
work page 2025
-
[31]
Phi-4-mini technical report: Compact yet powerful multimodal language models via mixture-of-loras,
Microsoft, “Phi-4-mini technical report: Compact yet powerful multimodal language models via mixture-of-loras,” 2025
work page 2025
- [32]
-
[33]
Fleurs: Few-shot learning evaluation of uni- versal representations of speech,
A. Conneauet al., “Fleurs: Few-shot learning evaluation of uni- versal representations of speech,” inIEEE SLT, 2023
work page 2023
-
[34]
F. Retkowskiet al., “From Text Segmentation to Smart Chapter- ing: A Novel Benchmark for Structuring Video Transcriptions,” inEACL, 2024
work page 2024
-
[35]
From WER and RIL to MER and WIL: im- proved evaluation measures for connected speech recognition,
A. C. Morriset al., “From WER and RIL to MER and WIL: im- proved evaluation measures for connected speech recognition,” in Interspeech, 2004
work page 2004
-
[36]
CometKiwi: IST-unbabel 2022 submission for the quality estimation shared task,
R. Reiet al., “CometKiwi: IST-unbabel 2022 submission for the quality estimation shared task,” inProc. WMT, 2022
work page 2022
-
[37]
Are LLMs breaking MT metrics? results of the WMT24 metrics shared task,
M. Freitaget al., “Are LLMs breaking MT metrics? results of the WMT24 metrics shared task,” inProceedings of the Ninth Con- ference on Machine Translation, B. Haddow, T. Kocmi, P. Koehn, and C. Monz, Eds. Miami, Florida, USA: Association for Com- putational Linguistics, Nov. 2024, pp. 47–81
work page 2024
-
[38]
Hearing to Translate: The Effectiveness of Speech Modality Integration into LLMs,
S. Papiet al., “Hearing to Translate: The Effectiveness of Speech Modality Integration into LLMs,” 2025
work page 2025
-
[39]
Bertscore: Evaluating text generation with BERT,
T. Zhanget al., “Bertscore: Evaluating text generation with BERT,” inICLR, 2020
work page 2020
-
[40]
Deberta: Decoding-enhanced bert with disentangled attention,
P. Heet al., “Deberta: Decoding-enhanced bert with disentangled attention,” inICLR, 2021
work page 2021
-
[41]
Robust speech recognition via large-scale weak supervision,
A. Radfordet al., “Robust speech recognition via large-scale weak supervision,” inProc. ICML, 2023
work page 2023
-
[42]
UTMOS: UTokyo-SaruLab System for V oice- MOS Challenge 2022,
T. Saekiet al., “UTMOS: UTokyo-SaruLab System for V oice- MOS Challenge 2022,” inInterspeech, 2022
work page 2022
-
[43]
Beyond transcripts: A renewed perspective on audio chaptering,
F. Retkowskiet al., “Beyond transcripts: A renewed perspective on audio chaptering,” 2026
work page 2026
-
[44]
Twists, humps, and pebbles: Multilingual speech recognition models exhibit gender performance gaps,
G. Attanasioet al., “Twists, humps, and pebbles: Multilingual speech recognition models exhibit gender performance gaps,” in Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Miami, Florida, USA, Nov. 2024, pp. 21 318–21 340
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.