pith. sign in

arxiv: 2604.26500 · v1 · submitted 2026-04-29 · 💻 cs.CL

StarDrinks: An English and Korean Test Set for SLU Evaluation in a Drink Ordering Scenario

Pith reviewed 2026-05-07 11:22 UTC · model grok-4.3

classification 💻 cs.CL
keywords spoken language understandingtest setdrink orderingEnglishKoreanslot fillingmultilingual evaluationspontaneous speech
0
0 comments X

The pith

A new English and Korean test set benchmarks speech-to-slots understanding for drink ordering with real variability.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces StarDrinks to address the shortcoming that current evaluations of speech assistants and LLMs rely on overly controlled scenarios. Drink ordering requires handling diverse named entities, sizes, customizations, brands, and spontaneous speech events such as hesitations and self-corrections. The dataset supplies speech features, transcriptions, and slot annotations in two languages, enabling three distinct evaluation tracks: speech-to-slots SLU, transcription-to-slots NLU, and speech-to-transcription ASR. A reader would care because the set offers a concrete way to measure whether models actually generalize to the messy requests users make in practice.

Core claim

We introduce StarDrinks, a test set in English and Korean containing speech utterances features, transcriptions, and annotated slots. Our dataset supports speech-to-slots SLU, transcription-to-slots NLU, and speech-to-transcription ASR evaluation, providing a realistic benchmark for model robustness and generalization in a linguistically rich, real-world task.

What carries the argument

The StarDrinks dataset of paired speech recordings, transcriptions, and slot annotations that supports three evaluation modes in a single drink-ordering domain.

If this is right

  • Models can be scored for robustness to named entities, customizations, and spontaneous speech within one consistent domain.
  • The same utterances allow direct comparison of end-to-end SLU against cascaded NLU and ASR pipelines.
  • Performance numbers on StarDrinks can serve as a reference point for measuring progress in multilingual task-oriented understanding.
  • Developers gain a practical resource for identifying generalization failures before deployment.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same collection method could be replicated for other service domains such as food or travel booking to create comparable benchmarks.
  • Multilingual pairs like English-Korean may surface language-specific annotation or recognition challenges that single-language sets miss.
  • The slot annotations could be reused to generate synthetic training data that includes realistic hesitations and corrections.

Load-bearing premise

The collected utterances and annotations sufficiently capture the variability, complexity, and spontaneous phenomena of real user drink-ordering requests to serve as a generalizable benchmark.

What would settle it

A side-by-side comparison in which real customer drink orders contain substantially more or different variability than the annotated examples in StarDrinks would show the dataset is not representative.

Figures

Figures reproduced from arXiv: 2604.26500 by Caroline Brun, Denys Proux, Inyoung Kim, Ioan Calapodescu, Jean-Luc Meunier, Marcely Zanon Boito, Nikolaos Lagos, Salah Ait-Mokhtar.

Figure 1
Figure 1. Figure 1: StarDrinks data generation pipeline overview. 2. Related Works SLU and NLU datasets typically consist of speech and text data, respectively, accompanied by annotations for intent classification and/or slot filling tasks. Moreover, datasets can be either single or multi-turn, meaning that the task is accomplished in either a single or through multiple turns of interaction, respectively. In this paper, we fo… view at source ↗
Figure 2
Figure 2. Figure 2: StarDrinks data examples from the English (top) and Korean (bottom) splits. set of receipts and asking them to record utterances ordering the corresponding items (Section 3.2). From the collected speech, we generated transcripts using a state-of-the-art ASR model, followed by manual correc￾tion of both transcripts and slot annotations to produce the final output (Section 3.3). The final StarDrinks SLU test… view at source ↗
Figure 3
Figure 3. Figure 3: An example from our recording session for English. view at source ↗
Figure 4
Figure 4. Figure 4: An example of prompt for NLU (3-shots, English). view at source ↗
read the original abstract

LLMs and speech assistants are increasingly used for task-oriented interactions, yet their evaluation often relies on controlled scenarios that fail to capture the variability and complexity of real user requests. Drink ordering, for example, involves diverse named entities, drink types, sizes, customizations, and brand-specific terminology, as well as spontaneous speech phenomena such as hesitations and self-corrections. To address this gap, we introduce StarDrinks, a test set in English and Korean containing speech utterances features, transcriptions, and annotated slots. Our dataset supports speech-to-slots SLU, transcription-to-slots NLU, and speech-to-transcription ASR evaluation, providing a realistic benchmark for model robustness and generalization in a linguistically rich, real-world task.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper introduces StarDrinks, a bilingual (English and Korean) test set for spoken language understanding evaluation in a drink-ordering scenario. It consists of speech utterances with associated features, transcriptions, and slot annotations, supporting three evaluation modes: speech-to-slots SLU, transcription-to-slots NLU, and speech-to-transcription ASR. The central claim is that the dataset supplies a realistic benchmark because its utterances contain diverse named entities, drink customizations, brand-specific terminology, and spontaneous phenomena such as hesitations and self-corrections, thereby addressing limitations of controlled evaluation scenarios for LLMs and speech assistants.

Significance. If the dataset were shown to derive from natural interactions with reliable annotations, it would constitute a useful addition to the limited set of realistic, domain-specific resources for task-oriented SLU, particularly in a bilingual setting with high entity and customization variability. The release of such a test set could help expose generalization failures that synthetic data miss. The paper's main strength is the provision of a new, publicly oriented resource in an under-served domain; however, this potential remains unrealized without documentation of its construction.

major comments (1)
  1. [Abstract] Abstract: The claim that StarDrinks 'provides a realistic benchmark for model robustness and generalization' is load-bearing for the paper's contribution, yet the manuscript supplies no information on collection methodology (e.g., prompted vs. natural speech), speaker count or demographics, disfluency rates, entity coverage statistics, annotation guidelines, or inter-annotator agreement. Without these, the central realism and generalizability assertions cannot be evaluated.
minor comments (1)
  1. [Abstract] Abstract: The phrase 'speech utterances features' is vague; specifying which acoustic or prosodic features are provided would improve clarity for readers interested in ASR or SLU pipelines.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive review and for recognizing the potential utility of StarDrinks as a domain-specific resource for bilingual SLU evaluation. We address the major comment below and will incorporate the requested documentation into the revised manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The claim that StarDrinks 'provides a realistic benchmark for model robustness and generalization' is load-bearing for the paper's contribution, yet the manuscript supplies no information on collection methodology (e.g., prompted vs. natural speech), speaker count or demographics, disfluency rates, entity coverage statistics, annotation guidelines, or inter-annotator agreement. Without these, the central realism and generalizability assertions cannot be evaluated.

    Authors: We agree that the manuscript would be strengthened by explicit details on dataset construction to substantiate the realism claims. The submitted version does not provide these specifics. In the revision we will add a dedicated 'Dataset Construction' section describing the collection methodology, speaker count and demographics, disfluency rates, entity coverage statistics, annotation guidelines, and inter-annotator agreement. This will allow readers to evaluate the benchmark's properties directly. revision: yes

Circularity Check

0 steps flagged

No circularity in dataset introduction paper

full rationale

The paper is a direct resource contribution that introduces the StarDrinks test set with speech utterances, transcriptions, and slot annotations for English and Korean drink-ordering scenarios. It contains no derivations, equations, fitted parameters, predictions, or self-citations that function as load-bearing premises. The central claim—that the dataset provides a realistic benchmark for SLU/NLU/ASR evaluation—is presented as following immediately from the described collection and annotation process, without any reduction to self-definition, renamed empirical patterns, or imported uniqueness results. No step in the manuscript reduces by construction to its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a dataset release paper with no mathematical model, free parameters, axioms, or invented entities; the contribution rests entirely on the creation and annotation of the resource itself.

pith-pipeline@v0.9.0 · 5455 in / 1084 out tokens · 50495 ms · 2026-05-07T11:22:14.495413+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages · 1 internal anchor

  1. [1]

    Rubino, Nicolas Guenon des mesnards, Uday Shah, Nanjiang Jiang, Weiqi Sun, and Konstantine Arkoudas

    Melanie A. Rubino, Nicolas Guenon des mesnards, Uday Shah, Nanjiang Jiang, Weiqi Sun, and Konstantine Arkoudas. Cross-TOP: Zero-shot cross-schema task- oriented parsing. InProceedings of the Third Workshop on Deep Learning for Low-Resource Natural Language Processing, pages 48–60, Hybrid, 2022. Association for Computational Linguistics. 2

  2. [2]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023. 4

  3. [3]

    Slurp: A spoken language understanding resource package

    Emanuele Bastianelli, Andrea Vanzo, Pawel Swietojan- ski, Verena Rieser, and Oliver Lemon. Slurp: A spoken language understanding resource package. InProceed- ings of the 2020 Conference on Empirical Methods in Nat- ural Language Processing (EMNLP), pages 7252–7262,

  4. [4]

    Low-resource domain adap- tation for compositional task-oriented semantic pars- ing

    Xilun Chen, Asish Ghoshal, Yashar Mehdad, Luke Zettle- moyer, and Sonal Gupta. Low-resource domain adap- tation for compositional task-oriented semantic pars- ing. InProceedings of the 2020 Conference on Empiri- cal Methods in Natural Language Processing (EMNLP), pages 5090–5100, Online, 2020. Association for Com- putational Linguistics. 2

  5. [5]

    Snips voice platform: An embedded spoken language understanding system for private-by-design voice interfaces

    AliceCoucke,AlaaSaade,AdrienBall,ThéodoreBluche, Alexandre Caulier, David Leroy, Clément Doumouro, Thibault Gisselbrecht, Francesco Caltagirone, Thibaut Lavril, et al. Snips voice platform: An embedded spoken language understanding system for private-by-design voice interfaces. InProceedings of the Annual Conference of the International Speech Communicati...

  6. [6]

    MASSIVE: A 1M- example multilingual natural language understanding dataset with 51 typologically-diverse languages

    JackFitzGerald, ChristopherHench, CharithPeris, Scott Mackie, Kay Rottmann, Ana Sanchez, Aaron Nash, Liam Urbach, Vishesh Kakarala, Richa Singh, Swetha Ran- ganath, Laurie Crist, Misha Britan, Wouter Leeuwis, Gokhan Tur, and Prem Natarajan. MASSIVE: A 1M- example multilingual natural language understanding dataset with 51 typologically-diverse languages. ...

  7. [7]

    Speech intent recognitionforsmartassistants: Areview

    Praveen Gupta, Anurag Gupta, et al. Speech intent recognitionforsmartassistants: Areview. InProceedings of the IEEE International Conference on Smart Systems andInventiveTechnology(ICSSIT),pages489–495,2020. 2

  8. [8]

    The atis spoken language systems pilot corpus

    Charles T Hemphill, John J Godfrey, and George R Dod- dington. The atis spoken language systems pilot corpus. InSpeech and Natural Language: Proceedings of a Work- shop, pages 96–101. ACL, 1990. 2

  9. [9]

    Speech-MASSIVE: A Multilingual Speech Dataset for SLU and Beyond

    Beomseok Lee, Ioan Calapodescu, Marco Gaido, Mat- 7 StarDrinks teo Negri, and Laurent Besacier. Speech-MASSIVE: A Multilingual Speech Dataset for SLU and Beyond. In Interspeech 2024, pages 817–821, 2024. 2

  10. [10]

    Con- tinual test-time adaptation for end-to-end speech recog- nitiononnoisyspeech

    Guan-Ting Lin, Wei Ping Huang, and Hung-yi Lee. Con- tinual test-time adaptation for end-to-end speech recog- nitiononnoisyspeech. InProceedingsofthe2024Confer- ence on Empirical Methods in Natural Language Process- ing, pages 20003–20015, Miami, Florida, USA, 2024. Association for Computational Linguistics. 5

  11. [11]

    Speech model pre-training for end-to-end spoken language understanding

    Loren Lugosch, Mirco Ravanelli, Patrick Ignoto, Vikrant Tomar, and Yoshua Bengio. Speech model pre-training for end-to-end spoken language understanding. InIn- terspeech, pages 814–818, 2019. 2

  12. [12]

    On robustness and reliability of benchmark-based evaluation of llms

    RiccardoLunardi,VincenzoDellaMea,StefanoMizzaro, and Kevin Roitero. On robustness and reliability of benchmark-based evaluation of llms.arXiv preprint arXiv:2509.04013, 2025. 1

  13. [13]

    Speech-enriched memory for inference-time adaptation of ASR models to word dictionaries

    Ashish Mittal, Sunita Sarawagi, Preethi Jyothi, George Saon, and Gakuto Kurata. Speech-enriched memory for inference-time adaptation of ASR models to word dictionaries. InProceedings of the 2023 Conference on EmpiricalMethodsinNaturalLanguageProcessing, pages 14820–14835, Singapore, 2023. Association for Com- putational Linguistics. 5

  14. [14]

    Scaling speech technology to 1,000+ languages.Jour- nal of Machine Learning Research, 25(97):1–52, 2024

    Vineel Pratap, Andros Tjandra, Bowen Shi, Paden Tomasello, Arun Babu, Sayani Kundu, Ali Elkahky, Zhao- heng Ni, Apoorv Vyas, Maryam Fazel-Zarandi, et al. Scaling speech technology to 1,000+ languages.Jour- nal of Machine Learning Research, 25(97):1–52, 2024. 4

  15. [15]

    Robust speech recognition via large-scale weak supervision

    Alec Radford, Jong Wook Kim, Tao Xu, Greg Brock- man, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision. InInternational conference on machine learning, pages 28492–28518. PMLR, 2023. 4

  16. [16]

    Cross-TOP: Zero-Shot Cross-Schema Task- Oriented Parsing

    Melanie Rubino, Nicolas Guenon des Mesnards, Uday Shah, Nanjiang Jiang, Weiqi Sun, and Konstantine Arkoudas. Cross-TOP: Zero-Shot Cross-Schema Task- Oriented Parsing. InProceedings of the Third Workshop on Deep Learning for Low-Resource Natural Language Processing, pages 48–60, Hybrid, 2022. Association for Computational Linguistics. 6

  17. [17]

    Stop: A dataset for spoken task oriented semantic parsing

    Paden Tomasello, Akshat Shrivastava, Daniel Lazar, Po- Chun Hsu, Duc Le, Adithya Sagar, Ali Elkahky, Jade Copet, Wei-Ning Hsu, Yossi Adi, Robin Algayres, Tu Ahn Nguyen, Emmanuel Dupoux, Luke Zettlemoyer, and Abdelrahman Mohamed. Stop: A dataset for spoken task oriented semantic parsing. In2022 IEEE Spoken Language Technology Workshop (SLT), pages 991–998,

  18. [18]

    Rethinking gen- erative large language model evaluation for semantic comprehension

    Fangyun Wei, Xi Chen, and Lin Luo. Rethinking gen- erative large language model evaluation for semantic comprehension. InProceedings of the 41st International Conference on Machine Learning, pages 52525–52558. PMLR, 2024. 1

  19. [19]

    Probe-rewrite-evaluate: A workflow for reliable benchmarks and quantifying evaluation awareness

    Lang Xiong, Nishant Bhargava, Wesley Chang, Jian- hang Hong, Haihao Liu, and Kevin Zhu. Stealtheval: A probe-rewrite-evaluate workflow for reliable bench- marks.arXiv preprint arXiv:2509.00591, 2025. 1

  20. [20]

    Eunseop Yoon, Hee Suk Yoon, John Harvill, Mark Hasegawa-Johnson, and Chang D. Yoo. LI-TTA: Lan- guage Informed Test-Time Adaptation for Automatic Speech Recognition. InInterspeech 2024, pages 3490– 3494, 2024. 5 8