StarDrinks: An English and Korean Test Set for SLU Evaluation in a Drink Ordering Scenario

Caroline Brun; Denys Proux; Inyoung Kim; Ioan Calapodescu; Jean-Luc Meunier; Marcely Zanon Boito; Nikolaos Lagos; Salah Ait-Mokhtar

arxiv: 2604.26500 · v1 · submitted 2026-04-29 · 💻 cs.CL

StarDrinks: An English and Korean Test Set for SLU Evaluation in a Drink Ordering Scenario

Marcely Zanon Boito , Caroline Brun , Inyoung Kim , Denys Proux , Salah Ait-Mokhtar , Nikolaos Lagos , Jean-Luc Meunier , Ioan Calapodescu This is my paper

Pith reviewed 2026-05-07 11:22 UTC · model grok-4.3

classification 💻 cs.CL

keywords spoken language understandingtest setdrink orderingEnglishKoreanslot fillingmultilingual evaluationspontaneous speech

0 comments

The pith

A new English and Korean test set benchmarks speech-to-slots understanding for drink ordering with real variability.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces StarDrinks to address the shortcoming that current evaluations of speech assistants and LLMs rely on overly controlled scenarios. Drink ordering requires handling diverse named entities, sizes, customizations, brands, and spontaneous speech events such as hesitations and self-corrections. The dataset supplies speech features, transcriptions, and slot annotations in two languages, enabling three distinct evaluation tracks: speech-to-slots SLU, transcription-to-slots NLU, and speech-to-transcription ASR. A reader would care because the set offers a concrete way to measure whether models actually generalize to the messy requests users make in practice.

Core claim

We introduce StarDrinks, a test set in English and Korean containing speech utterances features, transcriptions, and annotated slots. Our dataset supports speech-to-slots SLU, transcription-to-slots NLU, and speech-to-transcription ASR evaluation, providing a realistic benchmark for model robustness and generalization in a linguistically rich, real-world task.

What carries the argument

The StarDrinks dataset of paired speech recordings, transcriptions, and slot annotations that supports three evaluation modes in a single drink-ordering domain.

If this is right

Models can be scored for robustness to named entities, customizations, and spontaneous speech within one consistent domain.
The same utterances allow direct comparison of end-to-end SLU against cascaded NLU and ASR pipelines.
Performance numbers on StarDrinks can serve as a reference point for measuring progress in multilingual task-oriented understanding.
Developers gain a practical resource for identifying generalization failures before deployment.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same collection method could be replicated for other service domains such as food or travel booking to create comparable benchmarks.
Multilingual pairs like English-Korean may surface language-specific annotation or recognition challenges that single-language sets miss.
The slot annotations could be reused to generate synthetic training data that includes realistic hesitations and corrections.

Load-bearing premise

The collected utterances and annotations sufficiently capture the variability, complexity, and spontaneous phenomena of real user drink-ordering requests to serve as a generalizable benchmark.

What would settle it

A side-by-side comparison in which real customer drink orders contain substantially more or different variability than the annotated examples in StarDrinks would show the dataset is not representative.

Figures

Figures reproduced from arXiv: 2604.26500 by Caroline Brun, Denys Proux, Inyoung Kim, Ioan Calapodescu, Jean-Luc Meunier, Marcely Zanon Boito, Nikolaos Lagos, Salah Ait-Mokhtar.

**Figure 1.** Figure 1: StarDrinks data generation pipeline overview. 2. Related Works SLU and NLU datasets typically consist of speech and text data, respectively, accompanied by annotations for intent classification and/or slot filling tasks. Moreover, datasets can be either single or multi-turn, meaning that the task is accomplished in either a single or through multiple turns of interaction, respectively. In this paper, we fo… view at source ↗

**Figure 2.** Figure 2: StarDrinks data examples from the English (top) and Korean (bottom) splits. set of receipts and asking them to record utterances ordering the corresponding items (Section 3.2). From the collected speech, we generated transcripts using a state-of-the-art ASR model, followed by manual correction of both transcripts and slot annotations to produce the final output (Section 3.3). The final StarDrinks SLU test… view at source ↗

**Figure 3.** Figure 3: An example from our recording session for English. view at source ↗

**Figure 4.** Figure 4: An example of prompt for NLU (3-shots, English). view at source ↗

read the original abstract

LLMs and speech assistants are increasingly used for task-oriented interactions, yet their evaluation often relies on controlled scenarios that fail to capture the variability and complexity of real user requests. Drink ordering, for example, involves diverse named entities, drink types, sizes, customizations, and brand-specific terminology, as well as spontaneous speech phenomena such as hesitations and self-corrections. To address this gap, we introduce StarDrinks, a test set in English and Korean containing speech utterances features, transcriptions, and annotated slots. Our dataset supports speech-to-slots SLU, transcription-to-slots NLU, and speech-to-transcription ASR evaluation, providing a realistic benchmark for model robustness and generalization in a linguistically rich, real-world task.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

StarDrinks is a new bilingual test set for drink-ordering SLU that covers multiple evaluation modes, but its realism claim lacks any supporting details on collection or annotation.

read the letter

The main point here is a new test set called StarDrinks for evaluating SLU in English and Korean drink-ordering interactions. It supplies speech utterances with features, transcriptions, and slot annotations, and it supports three separate evaluation tracks: speech-to-slots, text-to-slots, and speech-to-transcription ASR. That setup is practical for checking model behavior across pipelines in one narrow domain. The bilingual angle and the focus on entity-rich customizations plus disfluencies are the clearest additions over generic SLU benchmarks. The paper frames the motivation cleanly by pointing out how most current test sets stay too scripted. Those elements give the work a modest but concrete use case for people tuning systems in food-service or similar task-oriented settings. The central weakness is that the abstract asserts realism and generalization without any numbers or methods to back it up. There is no report on speaker count, collection procedure, whether speech was natural or prompted, inter-annotator agreement on slots, total size, or coverage of spontaneous phenomena. Without those, the benchmark claim stays unverified and the paper reads like a standard resource release rather than a validated one. This is the sort of paper that might interest SLU researchers who need domain-specific or Korean-language test data for quick robustness checks. It does not shift the broader field. I would send it to peer review if the full version adds the missing collection and validation details; as it stands in the abstract, the evidence is too thin to judge the main selling point.

Referee Report

1 major / 1 minor

Summary. The paper introduces StarDrinks, a bilingual (English and Korean) test set for spoken language understanding evaluation in a drink-ordering scenario. It consists of speech utterances with associated features, transcriptions, and slot annotations, supporting three evaluation modes: speech-to-slots SLU, transcription-to-slots NLU, and speech-to-transcription ASR. The central claim is that the dataset supplies a realistic benchmark because its utterances contain diverse named entities, drink customizations, brand-specific terminology, and spontaneous phenomena such as hesitations and self-corrections, thereby addressing limitations of controlled evaluation scenarios for LLMs and speech assistants.

Significance. If the dataset were shown to derive from natural interactions with reliable annotations, it would constitute a useful addition to the limited set of realistic, domain-specific resources for task-oriented SLU, particularly in a bilingual setting with high entity and customization variability. The release of such a test set could help expose generalization failures that synthetic data miss. The paper's main strength is the provision of a new, publicly oriented resource in an under-served domain; however, this potential remains unrealized without documentation of its construction.

major comments (1)

[Abstract] Abstract: The claim that StarDrinks 'provides a realistic benchmark for model robustness and generalization' is load-bearing for the paper's contribution, yet the manuscript supplies no information on collection methodology (e.g., prompted vs. natural speech), speaker count or demographics, disfluency rates, entity coverage statistics, annotation guidelines, or inter-annotator agreement. Without these, the central realism and generalizability assertions cannot be evaluated.

minor comments (1)

[Abstract] Abstract: The phrase 'speech utterances features' is vague; specifying which acoustic or prosodic features are provided would improve clarity for readers interested in ASR or SLU pipelines.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive review and for recognizing the potential utility of StarDrinks as a domain-specific resource for bilingual SLU evaluation. We address the major comment below and will incorporate the requested documentation into the revised manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: The claim that StarDrinks 'provides a realistic benchmark for model robustness and generalization' is load-bearing for the paper's contribution, yet the manuscript supplies no information on collection methodology (e.g., prompted vs. natural speech), speaker count or demographics, disfluency rates, entity coverage statistics, annotation guidelines, or inter-annotator agreement. Without these, the central realism and generalizability assertions cannot be evaluated.

Authors: We agree that the manuscript would be strengthened by explicit details on dataset construction to substantiate the realism claims. The submitted version does not provide these specifics. In the revision we will add a dedicated 'Dataset Construction' section describing the collection methodology, speaker count and demographics, disfluency rates, entity coverage statistics, annotation guidelines, and inter-annotator agreement. This will allow readers to evaluate the benchmark's properties directly. revision: yes

Circularity Check

0 steps flagged

No circularity in dataset introduction paper

full rationale

The paper is a direct resource contribution that introduces the StarDrinks test set with speech utterances, transcriptions, and slot annotations for English and Korean drink-ordering scenarios. It contains no derivations, equations, fitted parameters, predictions, or self-citations that function as load-bearing premises. The central claim—that the dataset provides a realistic benchmark for SLU/NLU/ASR evaluation—is presented as following immediately from the described collection and annotation process, without any reduction to self-definition, renamed empirical patterns, or imported uniqueness results. No step in the manuscript reduces by construction to its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a dataset release paper with no mathematical model, free parameters, axioms, or invented entities; the contribution rests entirely on the creation and annotation of the resource itself.

pith-pipeline@v0.9.0 · 5455 in / 1084 out tokens · 50495 ms · 2026-05-07T11:22:14.495413+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages · 1 internal anchor

[1]

Rubino, Nicolas Guenon des mesnards, Uday Shah, Nanjiang Jiang, Weiqi Sun, and Konstantine Arkoudas

Melanie A. Rubino, Nicolas Guenon des mesnards, Uday Shah, Nanjiang Jiang, Weiqi Sun, and Konstantine Arkoudas. Cross-TOP: Zero-shot cross-schema task- oriented parsing. InProceedings of the Third Workshop on Deep Learning for Low-Resource Natural Language Processing, pages 48–60, Hybrid, 2022. Association for Computational Linguistics. 2

work page 2022
[2]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023. 4

work page internal anchor Pith review arXiv 2023
[3]

Slurp: A spoken language understanding resource package

Emanuele Bastianelli, Andrea Vanzo, Pawel Swietojan- ski, Verena Rieser, and Oliver Lemon. Slurp: A spoken language understanding resource package. InProceed- ings of the 2020 Conference on Empirical Methods in Nat- ural Language Processing (EMNLP), pages 7252–7262,

work page 2020
[4]

Low-resource domain adap- tation for compositional task-oriented semantic pars- ing

Xilun Chen, Asish Ghoshal, Yashar Mehdad, Luke Zettle- moyer, and Sonal Gupta. Low-resource domain adap- tation for compositional task-oriented semantic pars- ing. InProceedings of the 2020 Conference on Empiri- cal Methods in Natural Language Processing (EMNLP), pages 5090–5100, Online, 2020. Association for Com- putational Linguistics. 2

work page 2020
[5]

Snips voice platform: An embedded spoken language understanding system for private-by-design voice interfaces

AliceCoucke,AlaaSaade,AdrienBall,ThéodoreBluche, Alexandre Caulier, David Leroy, Clément Doumouro, Thibault Gisselbrecht, Francesco Caltagirone, Thibaut Lavril, et al. Snips voice platform: An embedded spoken language understanding system for private-by-design voice interfaces. InProceedings of the Annual Conference of the International Speech Communicati...

work page 2018
[6]

MASSIVE: A 1M- example multilingual natural language understanding dataset with 51 typologically-diverse languages

JackFitzGerald, ChristopherHench, CharithPeris, Scott Mackie, Kay Rottmann, Ana Sanchez, Aaron Nash, Liam Urbach, Vishesh Kakarala, Richa Singh, Swetha Ran- ganath, Laurie Crist, Misha Britan, Wouter Leeuwis, Gokhan Tur, and Prem Natarajan. MASSIVE: A 1M- example multilingual natural language understanding dataset with 51 typologically-diverse languages. ...

work page 2023
[7]

Speech intent recognitionforsmartassistants: Areview

Praveen Gupta, Anurag Gupta, et al. Speech intent recognitionforsmartassistants: Areview. InProceedings of the IEEE International Conference on Smart Systems andInventiveTechnology(ICSSIT),pages489–495,2020. 2

work page 2020
[8]

The atis spoken language systems pilot corpus

Charles T Hemphill, John J Godfrey, and George R Dod- dington. The atis spoken language systems pilot corpus. InSpeech and Natural Language: Proceedings of a Work- shop, pages 96–101. ACL, 1990. 2

work page 1990
[9]

Speech-MASSIVE: A Multilingual Speech Dataset for SLU and Beyond

Beomseok Lee, Ioan Calapodescu, Marco Gaido, Mat- 7 StarDrinks teo Negri, and Laurent Besacier. Speech-MASSIVE: A Multilingual Speech Dataset for SLU and Beyond. In Interspeech 2024, pages 817–821, 2024. 2

work page 2024
[10]

Con- tinual test-time adaptation for end-to-end speech recog- nitiononnoisyspeech

Guan-Ting Lin, Wei Ping Huang, and Hung-yi Lee. Con- tinual test-time adaptation for end-to-end speech recog- nitiononnoisyspeech. InProceedingsofthe2024Confer- ence on Empirical Methods in Natural Language Process- ing, pages 20003–20015, Miami, Florida, USA, 2024. Association for Computational Linguistics. 5

work page 2024
[11]

Speech model pre-training for end-to-end spoken language understanding

Loren Lugosch, Mirco Ravanelli, Patrick Ignoto, Vikrant Tomar, and Yoshua Bengio. Speech model pre-training for end-to-end spoken language understanding. InIn- terspeech, pages 814–818, 2019. 2

work page 2019
[12]

On robustness and reliability of benchmark-based evaluation of llms

RiccardoLunardi,VincenzoDellaMea,StefanoMizzaro, and Kevin Roitero. On robustness and reliability of benchmark-based evaluation of llms.arXiv preprint arXiv:2509.04013, 2025. 1

work page arXiv 2025
[13]

Speech-enriched memory for inference-time adaptation of ASR models to word dictionaries

Ashish Mittal, Sunita Sarawagi, Preethi Jyothi, George Saon, and Gakuto Kurata. Speech-enriched memory for inference-time adaptation of ASR models to word dictionaries. InProceedings of the 2023 Conference on EmpiricalMethodsinNaturalLanguageProcessing, pages 14820–14835, Singapore, 2023. Association for Com- putational Linguistics. 5

work page 2023
[14]

Scaling speech technology to 1,000+ languages.Jour- nal of Machine Learning Research, 25(97):1–52, 2024

Vineel Pratap, Andros Tjandra, Bowen Shi, Paden Tomasello, Arun Babu, Sayani Kundu, Ali Elkahky, Zhao- heng Ni, Apoorv Vyas, Maryam Fazel-Zarandi, et al. Scaling speech technology to 1,000+ languages.Jour- nal of Machine Learning Research, 25(97):1–52, 2024. 4

work page 2024
[15]

Robust speech recognition via large-scale weak supervision

Alec Radford, Jong Wook Kim, Tao Xu, Greg Brock- man, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision. InInternational conference on machine learning, pages 28492–28518. PMLR, 2023. 4

work page 2023
[16]

Cross-TOP: Zero-Shot Cross-Schema Task- Oriented Parsing

Melanie Rubino, Nicolas Guenon des Mesnards, Uday Shah, Nanjiang Jiang, Weiqi Sun, and Konstantine Arkoudas. Cross-TOP: Zero-Shot Cross-Schema Task- Oriented Parsing. InProceedings of the Third Workshop on Deep Learning for Low-Resource Natural Language Processing, pages 48–60, Hybrid, 2022. Association for Computational Linguistics. 6

work page 2022
[17]

Stop: A dataset for spoken task oriented semantic parsing

Paden Tomasello, Akshat Shrivastava, Daniel Lazar, Po- Chun Hsu, Duc Le, Adithya Sagar, Ali Elkahky, Jade Copet, Wei-Ning Hsu, Yossi Adi, Robin Algayres, Tu Ahn Nguyen, Emmanuel Dupoux, Luke Zettlemoyer, and Abdelrahman Mohamed. Stop: A dataset for spoken task oriented semantic parsing. In2022 IEEE Spoken Language Technology Workshop (SLT), pages 991–998,

work page
[18]

Rethinking gen- erative large language model evaluation for semantic comprehension

Fangyun Wei, Xi Chen, and Lin Luo. Rethinking gen- erative large language model evaluation for semantic comprehension. InProceedings of the 41st International Conference on Machine Learning, pages 52525–52558. PMLR, 2024. 1

work page 2024
[19]

Probe-rewrite-evaluate: A workflow for reliable benchmarks and quantifying evaluation awareness

Lang Xiong, Nishant Bhargava, Wesley Chang, Jian- hang Hong, Haihao Liu, and Kevin Zhu. Stealtheval: A probe-rewrite-evaluate workflow for reliable bench- marks.arXiv preprint arXiv:2509.00591, 2025. 1

work page arXiv 2025
[20]

Eunseop Yoon, Hee Suk Yoon, John Harvill, Mark Hasegawa-Johnson, and Chang D. Yoo. LI-TTA: Lan- guage Informed Test-Time Adaptation for Automatic Speech Recognition. InInterspeech 2024, pages 3490– 3494, 2024. 5 8

work page 2024

[1] [1]

Rubino, Nicolas Guenon des mesnards, Uday Shah, Nanjiang Jiang, Weiqi Sun, and Konstantine Arkoudas

Melanie A. Rubino, Nicolas Guenon des mesnards, Uday Shah, Nanjiang Jiang, Weiqi Sun, and Konstantine Arkoudas. Cross-TOP: Zero-shot cross-schema task- oriented parsing. InProceedings of the Third Workshop on Deep Learning for Low-Resource Natural Language Processing, pages 48–60, Hybrid, 2022. Association for Computational Linguistics. 2

work page 2022

[2] [2]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023. 4

work page internal anchor Pith review arXiv 2023

[3] [3]

Slurp: A spoken language understanding resource package

Emanuele Bastianelli, Andrea Vanzo, Pawel Swietojan- ski, Verena Rieser, and Oliver Lemon. Slurp: A spoken language understanding resource package. InProceed- ings of the 2020 Conference on Empirical Methods in Nat- ural Language Processing (EMNLP), pages 7252–7262,

work page 2020

[4] [4]

Low-resource domain adap- tation for compositional task-oriented semantic pars- ing

Xilun Chen, Asish Ghoshal, Yashar Mehdad, Luke Zettle- moyer, and Sonal Gupta. Low-resource domain adap- tation for compositional task-oriented semantic pars- ing. InProceedings of the 2020 Conference on Empiri- cal Methods in Natural Language Processing (EMNLP), pages 5090–5100, Online, 2020. Association for Com- putational Linguistics. 2

work page 2020

[5] [5]

Snips voice platform: An embedded spoken language understanding system for private-by-design voice interfaces

AliceCoucke,AlaaSaade,AdrienBall,ThéodoreBluche, Alexandre Caulier, David Leroy, Clément Doumouro, Thibault Gisselbrecht, Francesco Caltagirone, Thibaut Lavril, et al. Snips voice platform: An embedded spoken language understanding system for private-by-design voice interfaces. InProceedings of the Annual Conference of the International Speech Communicati...

work page 2018

[6] [6]

MASSIVE: A 1M- example multilingual natural language understanding dataset with 51 typologically-diverse languages

JackFitzGerald, ChristopherHench, CharithPeris, Scott Mackie, Kay Rottmann, Ana Sanchez, Aaron Nash, Liam Urbach, Vishesh Kakarala, Richa Singh, Swetha Ran- ganath, Laurie Crist, Misha Britan, Wouter Leeuwis, Gokhan Tur, and Prem Natarajan. MASSIVE: A 1M- example multilingual natural language understanding dataset with 51 typologically-diverse languages. ...

work page 2023

[7] [7]

Speech intent recognitionforsmartassistants: Areview

Praveen Gupta, Anurag Gupta, et al. Speech intent recognitionforsmartassistants: Areview. InProceedings of the IEEE International Conference on Smart Systems andInventiveTechnology(ICSSIT),pages489–495,2020. 2

work page 2020

[8] [8]

The atis spoken language systems pilot corpus

Charles T Hemphill, John J Godfrey, and George R Dod- dington. The atis spoken language systems pilot corpus. InSpeech and Natural Language: Proceedings of a Work- shop, pages 96–101. ACL, 1990. 2

work page 1990

[9] [9]

Speech-MASSIVE: A Multilingual Speech Dataset for SLU and Beyond

Beomseok Lee, Ioan Calapodescu, Marco Gaido, Mat- 7 StarDrinks teo Negri, and Laurent Besacier. Speech-MASSIVE: A Multilingual Speech Dataset for SLU and Beyond. In Interspeech 2024, pages 817–821, 2024. 2

work page 2024

[10] [10]

Con- tinual test-time adaptation for end-to-end speech recog- nitiononnoisyspeech

Guan-Ting Lin, Wei Ping Huang, and Hung-yi Lee. Con- tinual test-time adaptation for end-to-end speech recog- nitiononnoisyspeech. InProceedingsofthe2024Confer- ence on Empirical Methods in Natural Language Process- ing, pages 20003–20015, Miami, Florida, USA, 2024. Association for Computational Linguistics. 5

work page 2024

[11] [11]

Speech model pre-training for end-to-end spoken language understanding

Loren Lugosch, Mirco Ravanelli, Patrick Ignoto, Vikrant Tomar, and Yoshua Bengio. Speech model pre-training for end-to-end spoken language understanding. InIn- terspeech, pages 814–818, 2019. 2

work page 2019

[12] [12]

On robustness and reliability of benchmark-based evaluation of llms

RiccardoLunardi,VincenzoDellaMea,StefanoMizzaro, and Kevin Roitero. On robustness and reliability of benchmark-based evaluation of llms.arXiv preprint arXiv:2509.04013, 2025. 1

work page arXiv 2025

[13] [13]

Speech-enriched memory for inference-time adaptation of ASR models to word dictionaries

Ashish Mittal, Sunita Sarawagi, Preethi Jyothi, George Saon, and Gakuto Kurata. Speech-enriched memory for inference-time adaptation of ASR models to word dictionaries. InProceedings of the 2023 Conference on EmpiricalMethodsinNaturalLanguageProcessing, pages 14820–14835, Singapore, 2023. Association for Com- putational Linguistics. 5

work page 2023

[14] [14]

Scaling speech technology to 1,000+ languages.Jour- nal of Machine Learning Research, 25(97):1–52, 2024

Vineel Pratap, Andros Tjandra, Bowen Shi, Paden Tomasello, Arun Babu, Sayani Kundu, Ali Elkahky, Zhao- heng Ni, Apoorv Vyas, Maryam Fazel-Zarandi, et al. Scaling speech technology to 1,000+ languages.Jour- nal of Machine Learning Research, 25(97):1–52, 2024. 4

work page 2024

[15] [15]

Robust speech recognition via large-scale weak supervision

Alec Radford, Jong Wook Kim, Tao Xu, Greg Brock- man, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision. InInternational conference on machine learning, pages 28492–28518. PMLR, 2023. 4

work page 2023

[16] [16]

Cross-TOP: Zero-Shot Cross-Schema Task- Oriented Parsing

Melanie Rubino, Nicolas Guenon des Mesnards, Uday Shah, Nanjiang Jiang, Weiqi Sun, and Konstantine Arkoudas. Cross-TOP: Zero-Shot Cross-Schema Task- Oriented Parsing. InProceedings of the Third Workshop on Deep Learning for Low-Resource Natural Language Processing, pages 48–60, Hybrid, 2022. Association for Computational Linguistics. 6

work page 2022

[17] [17]

Stop: A dataset for spoken task oriented semantic parsing

Paden Tomasello, Akshat Shrivastava, Daniel Lazar, Po- Chun Hsu, Duc Le, Adithya Sagar, Ali Elkahky, Jade Copet, Wei-Ning Hsu, Yossi Adi, Robin Algayres, Tu Ahn Nguyen, Emmanuel Dupoux, Luke Zettlemoyer, and Abdelrahman Mohamed. Stop: A dataset for spoken task oriented semantic parsing. In2022 IEEE Spoken Language Technology Workshop (SLT), pages 991–998,

work page

[18] [18]

Rethinking gen- erative large language model evaluation for semantic comprehension

Fangyun Wei, Xi Chen, and Lin Luo. Rethinking gen- erative large language model evaluation for semantic comprehension. InProceedings of the 41st International Conference on Machine Learning, pages 52525–52558. PMLR, 2024. 1

work page 2024

[19] [19]

Probe-rewrite-evaluate: A workflow for reliable benchmarks and quantifying evaluation awareness

Lang Xiong, Nishant Bhargava, Wesley Chang, Jian- hang Hong, Haihao Liu, and Kevin Zhu. Stealtheval: A probe-rewrite-evaluate workflow for reliable bench- marks.arXiv preprint arXiv:2509.00591, 2025. 1

work page arXiv 2025

[20] [20]

Eunseop Yoon, Hee Suk Yoon, John Harvill, Mark Hasegawa-Johnson, and Chang D. Yoo. LI-TTA: Lan- guage Informed Test-Time Adaptation for Automatic Speech Recognition. InInterspeech 2024, pages 3490– 3494, 2024. 5 8

work page 2024