StarDrinks: An English and Korean Test Set for SLU Evaluation in a Drink Ordering Scenario
Pith reviewed 2026-05-07 11:22 UTC · model grok-4.3
The pith
A new English and Korean test set benchmarks speech-to-slots understanding for drink ordering with real variability.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We introduce StarDrinks, a test set in English and Korean containing speech utterances features, transcriptions, and annotated slots. Our dataset supports speech-to-slots SLU, transcription-to-slots NLU, and speech-to-transcription ASR evaluation, providing a realistic benchmark for model robustness and generalization in a linguistically rich, real-world task.
What carries the argument
The StarDrinks dataset of paired speech recordings, transcriptions, and slot annotations that supports three evaluation modes in a single drink-ordering domain.
If this is right
- Models can be scored for robustness to named entities, customizations, and spontaneous speech within one consistent domain.
- The same utterances allow direct comparison of end-to-end SLU against cascaded NLU and ASR pipelines.
- Performance numbers on StarDrinks can serve as a reference point for measuring progress in multilingual task-oriented understanding.
- Developers gain a practical resource for identifying generalization failures before deployment.
Where Pith is reading between the lines
- The same collection method could be replicated for other service domains such as food or travel booking to create comparable benchmarks.
- Multilingual pairs like English-Korean may surface language-specific annotation or recognition challenges that single-language sets miss.
- The slot annotations could be reused to generate synthetic training data that includes realistic hesitations and corrections.
Load-bearing premise
The collected utterances and annotations sufficiently capture the variability, complexity, and spontaneous phenomena of real user drink-ordering requests to serve as a generalizable benchmark.
What would settle it
A side-by-side comparison in which real customer drink orders contain substantially more or different variability than the annotated examples in StarDrinks would show the dataset is not representative.
Figures
read the original abstract
LLMs and speech assistants are increasingly used for task-oriented interactions, yet their evaluation often relies on controlled scenarios that fail to capture the variability and complexity of real user requests. Drink ordering, for example, involves diverse named entities, drink types, sizes, customizations, and brand-specific terminology, as well as spontaneous speech phenomena such as hesitations and self-corrections. To address this gap, we introduce StarDrinks, a test set in English and Korean containing speech utterances features, transcriptions, and annotated slots. Our dataset supports speech-to-slots SLU, transcription-to-slots NLU, and speech-to-transcription ASR evaluation, providing a realistic benchmark for model robustness and generalization in a linguistically rich, real-world task.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces StarDrinks, a bilingual (English and Korean) test set for spoken language understanding evaluation in a drink-ordering scenario. It consists of speech utterances with associated features, transcriptions, and slot annotations, supporting three evaluation modes: speech-to-slots SLU, transcription-to-slots NLU, and speech-to-transcription ASR. The central claim is that the dataset supplies a realistic benchmark because its utterances contain diverse named entities, drink customizations, brand-specific terminology, and spontaneous phenomena such as hesitations and self-corrections, thereby addressing limitations of controlled evaluation scenarios for LLMs and speech assistants.
Significance. If the dataset were shown to derive from natural interactions with reliable annotations, it would constitute a useful addition to the limited set of realistic, domain-specific resources for task-oriented SLU, particularly in a bilingual setting with high entity and customization variability. The release of such a test set could help expose generalization failures that synthetic data miss. The paper's main strength is the provision of a new, publicly oriented resource in an under-served domain; however, this potential remains unrealized without documentation of its construction.
major comments (1)
- [Abstract] Abstract: The claim that StarDrinks 'provides a realistic benchmark for model robustness and generalization' is load-bearing for the paper's contribution, yet the manuscript supplies no information on collection methodology (e.g., prompted vs. natural speech), speaker count or demographics, disfluency rates, entity coverage statistics, annotation guidelines, or inter-annotator agreement. Without these, the central realism and generalizability assertions cannot be evaluated.
minor comments (1)
- [Abstract] Abstract: The phrase 'speech utterances features' is vague; specifying which acoustic or prosodic features are provided would improve clarity for readers interested in ASR or SLU pipelines.
Simulated Author's Rebuttal
We thank the referee for their constructive review and for recognizing the potential utility of StarDrinks as a domain-specific resource for bilingual SLU evaluation. We address the major comment below and will incorporate the requested documentation into the revised manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract: The claim that StarDrinks 'provides a realistic benchmark for model robustness and generalization' is load-bearing for the paper's contribution, yet the manuscript supplies no information on collection methodology (e.g., prompted vs. natural speech), speaker count or demographics, disfluency rates, entity coverage statistics, annotation guidelines, or inter-annotator agreement. Without these, the central realism and generalizability assertions cannot be evaluated.
Authors: We agree that the manuscript would be strengthened by explicit details on dataset construction to substantiate the realism claims. The submitted version does not provide these specifics. In the revision we will add a dedicated 'Dataset Construction' section describing the collection methodology, speaker count and demographics, disfluency rates, entity coverage statistics, annotation guidelines, and inter-annotator agreement. This will allow readers to evaluate the benchmark's properties directly. revision: yes
Circularity Check
No circularity in dataset introduction paper
full rationale
The paper is a direct resource contribution that introduces the StarDrinks test set with speech utterances, transcriptions, and slot annotations for English and Korean drink-ordering scenarios. It contains no derivations, equations, fitted parameters, predictions, or self-citations that function as load-bearing premises. The central claim—that the dataset provides a realistic benchmark for SLU/NLU/ASR evaluation—is presented as following immediately from the described collection and annotation process, without any reduction to self-definition, renamed empirical patterns, or imported uniqueness results. No step in the manuscript reduces by construction to its own inputs.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Rubino, Nicolas Guenon des mesnards, Uday Shah, Nanjiang Jiang, Weiqi Sun, and Konstantine Arkoudas
Melanie A. Rubino, Nicolas Guenon des mesnards, Uday Shah, Nanjiang Jiang, Weiqi Sun, and Konstantine Arkoudas. Cross-TOP: Zero-shot cross-schema task- oriented parsing. InProceedings of the Third Workshop on Deep Learning for Low-Resource Natural Language Processing, pages 48–60, Hybrid, 2022. Association for Computational Linguistics. 2
work page 2022
-
[2]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023. 4
work page internal anchor Pith review arXiv 2023
-
[3]
Slurp: A spoken language understanding resource package
Emanuele Bastianelli, Andrea Vanzo, Pawel Swietojan- ski, Verena Rieser, and Oliver Lemon. Slurp: A spoken language understanding resource package. InProceed- ings of the 2020 Conference on Empirical Methods in Nat- ural Language Processing (EMNLP), pages 7252–7262,
work page 2020
-
[4]
Low-resource domain adap- tation for compositional task-oriented semantic pars- ing
Xilun Chen, Asish Ghoshal, Yashar Mehdad, Luke Zettle- moyer, and Sonal Gupta. Low-resource domain adap- tation for compositional task-oriented semantic pars- ing. InProceedings of the 2020 Conference on Empiri- cal Methods in Natural Language Processing (EMNLP), pages 5090–5100, Online, 2020. Association for Com- putational Linguistics. 2
work page 2020
-
[5]
AliceCoucke,AlaaSaade,AdrienBall,ThéodoreBluche, Alexandre Caulier, David Leroy, Clément Doumouro, Thibault Gisselbrecht, Francesco Caltagirone, Thibaut Lavril, et al. Snips voice platform: An embedded spoken language understanding system for private-by-design voice interfaces. InProceedings of the Annual Conference of the International Speech Communicati...
work page 2018
-
[6]
JackFitzGerald, ChristopherHench, CharithPeris, Scott Mackie, Kay Rottmann, Ana Sanchez, Aaron Nash, Liam Urbach, Vishesh Kakarala, Richa Singh, Swetha Ran- ganath, Laurie Crist, Misha Britan, Wouter Leeuwis, Gokhan Tur, and Prem Natarajan. MASSIVE: A 1M- example multilingual natural language understanding dataset with 51 typologically-diverse languages. ...
work page 2023
-
[7]
Speech intent recognitionforsmartassistants: Areview
Praveen Gupta, Anurag Gupta, et al. Speech intent recognitionforsmartassistants: Areview. InProceedings of the IEEE International Conference on Smart Systems andInventiveTechnology(ICSSIT),pages489–495,2020. 2
work page 2020
-
[8]
The atis spoken language systems pilot corpus
Charles T Hemphill, John J Godfrey, and George R Dod- dington. The atis spoken language systems pilot corpus. InSpeech and Natural Language: Proceedings of a Work- shop, pages 96–101. ACL, 1990. 2
work page 1990
-
[9]
Speech-MASSIVE: A Multilingual Speech Dataset for SLU and Beyond
Beomseok Lee, Ioan Calapodescu, Marco Gaido, Mat- 7 StarDrinks teo Negri, and Laurent Besacier. Speech-MASSIVE: A Multilingual Speech Dataset for SLU and Beyond. In Interspeech 2024, pages 817–821, 2024. 2
work page 2024
-
[10]
Con- tinual test-time adaptation for end-to-end speech recog- nitiononnoisyspeech
Guan-Ting Lin, Wei Ping Huang, and Hung-yi Lee. Con- tinual test-time adaptation for end-to-end speech recog- nitiononnoisyspeech. InProceedingsofthe2024Confer- ence on Empirical Methods in Natural Language Process- ing, pages 20003–20015, Miami, Florida, USA, 2024. Association for Computational Linguistics. 5
work page 2024
-
[11]
Speech model pre-training for end-to-end spoken language understanding
Loren Lugosch, Mirco Ravanelli, Patrick Ignoto, Vikrant Tomar, and Yoshua Bengio. Speech model pre-training for end-to-end spoken language understanding. InIn- terspeech, pages 814–818, 2019. 2
work page 2019
-
[12]
On robustness and reliability of benchmark-based evaluation of llms
RiccardoLunardi,VincenzoDellaMea,StefanoMizzaro, and Kevin Roitero. On robustness and reliability of benchmark-based evaluation of llms.arXiv preprint arXiv:2509.04013, 2025. 1
-
[13]
Speech-enriched memory for inference-time adaptation of ASR models to word dictionaries
Ashish Mittal, Sunita Sarawagi, Preethi Jyothi, George Saon, and Gakuto Kurata. Speech-enriched memory for inference-time adaptation of ASR models to word dictionaries. InProceedings of the 2023 Conference on EmpiricalMethodsinNaturalLanguageProcessing, pages 14820–14835, Singapore, 2023. Association for Com- putational Linguistics. 5
work page 2023
-
[14]
Vineel Pratap, Andros Tjandra, Bowen Shi, Paden Tomasello, Arun Babu, Sayani Kundu, Ali Elkahky, Zhao- heng Ni, Apoorv Vyas, Maryam Fazel-Zarandi, et al. Scaling speech technology to 1,000+ languages.Jour- nal of Machine Learning Research, 25(97):1–52, 2024. 4
work page 2024
-
[15]
Robust speech recognition via large-scale weak supervision
Alec Radford, Jong Wook Kim, Tao Xu, Greg Brock- man, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision. InInternational conference on machine learning, pages 28492–28518. PMLR, 2023. 4
work page 2023
-
[16]
Cross-TOP: Zero-Shot Cross-Schema Task- Oriented Parsing
Melanie Rubino, Nicolas Guenon des Mesnards, Uday Shah, Nanjiang Jiang, Weiqi Sun, and Konstantine Arkoudas. Cross-TOP: Zero-Shot Cross-Schema Task- Oriented Parsing. InProceedings of the Third Workshop on Deep Learning for Low-Resource Natural Language Processing, pages 48–60, Hybrid, 2022. Association for Computational Linguistics. 6
work page 2022
-
[17]
Stop: A dataset for spoken task oriented semantic parsing
Paden Tomasello, Akshat Shrivastava, Daniel Lazar, Po- Chun Hsu, Duc Le, Adithya Sagar, Ali Elkahky, Jade Copet, Wei-Ning Hsu, Yossi Adi, Robin Algayres, Tu Ahn Nguyen, Emmanuel Dupoux, Luke Zettlemoyer, and Abdelrahman Mohamed. Stop: A dataset for spoken task oriented semantic parsing. In2022 IEEE Spoken Language Technology Workshop (SLT), pages 991–998,
-
[18]
Rethinking gen- erative large language model evaluation for semantic comprehension
Fangyun Wei, Xi Chen, and Lin Luo. Rethinking gen- erative large language model evaluation for semantic comprehension. InProceedings of the 41st International Conference on Machine Learning, pages 52525–52558. PMLR, 2024. 1
work page 2024
-
[19]
Probe-rewrite-evaluate: A workflow for reliable benchmarks and quantifying evaluation awareness
Lang Xiong, Nishant Bhargava, Wesley Chang, Jian- hang Hong, Haihao Liu, and Kevin Zhu. Stealtheval: A probe-rewrite-evaluate workflow for reliable bench- marks.arXiv preprint arXiv:2509.00591, 2025. 1
-
[20]
Eunseop Yoon, Hee Suk Yoon, John Harvill, Mark Hasegawa-Johnson, and Chang D. Yoo. LI-TTA: Lan- guage Informed Test-Time Adaptation for Automatic Speech Recognition. InInterspeech 2024, pages 3490– 3494, 2024. 5 8
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.