WASIL: In-the-Wild Arabic Spoken Interactions with LLMs

Firoj Alam; Hamdy Mubarak; Hunzalah Hassan Bhatti; Shammur Absar Chowdhury; Soon-Gyo Jung; Zien Sheikh Ali

arxiv: 2605.16364 · v1 · pith:AIL4AKRXnew · submitted 2026-05-09 · 💻 cs.SD · cs.AI· cs.CL

WASIL: In-the-Wild Arabic Spoken Interactions with LLMs

Zien Sheikh Ali , Hamdy Mubarak , Soon-Gyo Jung , Hunzalah Hassan Bhatti , Firoj Alam , Shammur Absar Chowdhury This is my paper

Pith reviewed 2026-05-20 23:11 UTC · model grok-4.3

classification 💻 cs.SD cs.AIcs.CL

keywords Arabic spoken interactionsLLM voice assistantsASR errorsdataset releaseanswerability annotationreference-free evaluationdialectal Arabic

0 comments

The pith

The WASIL dataset captures real Arabic spoken interactions with LLMs to isolate speech recognition errors from other causes of user dissatisfaction.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper releases a dataset of actual user voice interactions with Arabic LLM assistants that includes the audio, ASR output, the assistant's response, and the user's like or dislike rating. The collection has 8,529 turns and a 2,000-turn test set that covers standard Arabic as well as four major dialects. By using agreement among several ASR systems to create low-cost accurate transcripts and by labeling whether each turn is answerable or not, the authors separate cases where the system simply misheard the user from cases where the query was ambiguous, unsupported, or not a request at all. This separation matters because it lets developers see how much of the dissatisfaction comes from the speech recognition step rather than from the language model itself. The paper also describes a way to evaluate the quality of responses using multiple LLM judges without needing a gold standard answer.

Core claim

WASIL provides 8,529 turns of in-the-wild Arabic spoken LLM interactions with audio, ASR hypotheses, responses, and explicit feedback, plus a 2,000-turn test set labeled for MSA and four dialects. Low-cost gold transcripts are created through multi-ASR agreement-guided post-editing, and answerability is annotated to distinguish intrinsic unanswerability from ASR-induced degradation. Scalable reference-free evaluation is outlined using multi-judge LLM scoring for responses based on ASR versus gold transcripts.

What carries the argument

The WASIL dataset, multi-ASR agreement-guided post-editing for gold transcripts, and answerability annotations that isolate ASR effects.

If this is right

The feedback and labels allow direct measurement of ASR error impact on user satisfaction in Arabic voice assistants.
The dialect-labeled test set supports evaluation across different Arabic varieties.
Multi-judge LLM scoring provides a scalable way to compare ASR and gold transcript performance.
Answerability categories help exclude non-request turns from quality assessments.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This approach to low-cost transcription could support dataset creation for other languages with limited resources.
Developers might use the evaluation method to rapidly iterate on ASR improvements for voice LLMs.
The dataset could reveal patterns in how dialects affect interaction success rates.

Load-bearing premise

The multi-ASR agreement post-editing process produces transcripts that are accurate enough to distinguish ASR mistakes from intrinsic query problems.

What would settle it

Independent human transcription of a portion of the data shows significant errors in the gold transcripts, or the LLM judge ratings fail to correlate with human ratings of the responses.

Figures

Figures reproduced from arXiv: 2605.16364 by Firoj Alam, Hamdy Mubarak, Hunzalah Hassan Bhatti, Shammur Absar Chowdhury, Soon-Gyo Jung, Zien Sheikh Ali.

**Figure 1.** Figure 1: WASIL dataset development process, from multinational spoken prompts collection and cascaded model inference to multi-layer human annotation. Participants were recruited from four Arab countries, Algeria, Egypt, Sudan, and Syria. Data were collected over nine days using daily topic prompts that encouraged diverse interactions, including open discussion (any topic), follow-up questions, creative writing… view at source ↗

**Figure 2.** Figure 2: The distribution of cosine similarity scores between Fanar and Gemini transcriptions on the whole dataset (9,304). To further validate this hypothesis, we conduct an additional analysis using the gold transcriptions. We compute wordlevel (WER) and character-level (CER) edit distances between the gold transcriptions and the Fanar ASR outputs to quantify the actual post-editing effort required. Figures 3 a… view at source ↗

**Figure 3.** Figure 3: Cosine similarity between Fanar and Gemini transcriptions for post-edited utterances, split into higher CER cases that required more edits (left) and lower CER cases that required fewer edits (right). 3.6.6. ASR Errors vs. Response Dislikeness We investigate the potential reasons behind dislike reactions by examining the relationship between ASR similarity scores and user feedback [PITH_FULL_IMAGE:figure… view at source ↗

**Figure 4.** Figure 4: Cosine similarity between Fanar and Gemini transcriptions for post-edited utterances, split into higher WER cases that required more edits (left) and lower WER cases that required fewer edits (right). 0.0 0.2 0.4 0.6 0.8 1.0 Cosine Similarity 0 1000 2000 3000 4000 5000 Frequency 0% 0% 0% 0% 0% 1% 2% 5% 15% 72% Prompts With Like Response (7,482) 0.0 0.2 0.4 0.6 0.8 1.0 Cosine Similarity 0 100 200 300 400 5… view at source ↗

**Figure 5.** Figure 5: Distribution of cosine similarity scores between Fanar and Gemini transcriptions for spoken prompts categorized by like (left) and dislike (right) reactions. we provide the gold transcription to Fanar, which results in the correct answer. We also present an example of a prompt without transcription errors that still received a dislike reaction (Figure 6c). In this case, the prompt is incomplete and lacks … view at source ↗

**Figure 7.** Figure 7: Model performance metrics. (a) Direct audio input performance across models. (b) The impact of transcript quality on Average Pass Rate (APR) for cascaded models. pute the fraction of satisfied criteria: si = 1 Ni XNi j=1 1rij . (3) Equivalently, the overall ARS is the average of these per-query pass rates across the entire test set: ARS = 1 N XN i=1 si. (4) 4.3. Results [PITH_FULL_IMAGE:figures/full_fig_p… view at source ↗

read the original abstract

Large Language Models (LLMs) voice assistants are commonly built as cascaded Automatic Speech recognition (ASR) to LLM systems, where recognition errors can distort user intent. Dislikes may also arise from ambiguous, out-of-domain, or non-request turns, making it hard to isolate ASR effects. We release WASIL (it denotes connection or linking in Arabic): in-the-wild Arabic spoken interaction prompts with audio, ASR hypotheses, assistant responses, and explicit like/dislike feedback (8,529 turns; 14.2% dislikes), plus a 2,000-turn test set covering Modern Standard Arabic (MSA) and four major dialects with their labels. We provide low-cost gold transcripts via multi-ASR agreement-guided post-editing and annotate answerability (answerable, ambiguous/needs-clarification, unsupported, not-a-request/noise) to separate intrinsic unanswerability from ASR-induced degradation. Finally, we describe scalable reference-free evaluation of responses from ASR vs. gold transcripts using multi-judge LLM scoring.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

WASIL offers a practical dataset for Arabic spoken LLM interactions but lacks validation on its transcript quality.

read the letter

The main point is that this paper releases a dataset called WASIL for real-world Arabic spoken chats with LLMs, including user feedback and labels to help sort out speech recognition problems from other issues. They collected 8,529 turns with audio, ASR hypotheses, the responses given, and like or dislike ratings. A separate test set of 2,000 turns includes Modern Standard Arabic and four dialects, along with answerability tags like answerable, ambiguous, unsupported, or just noise. The idea is to use several ASR systems to create better transcripts through agreement-based editing, then use those to annotate what is truly unanswerable versus what got messed up by bad recognition. They also sketch a way to score the LLM responses without references by having multiple LLM judges compare outputs from ASR input and from the edited transcripts. This combination is new for Arabic. Most prior work either sticks to English or lacks the feedback and diagnostic labels. The dialect coverage and the focus on in-the-wild data make it relevant for building voice systems that actually work for Arabic users. Releasing the audio lets others test their own ASR setups on the same material. The soft spot is the missing proof that the post-edited transcripts are good enough. The abstract talks up the multi-ASR method for low-cost gold transcripts, but there are no numbers on accuracy, no word error rates, and no human checks on a sample of the data. If those transcripts still have mistakes, the whole separation of ASR effects from intrinsic problems becomes shaky, and the evaluation results lose reliability. The multi-judge scoring protocol could use more specifics too, like how scores are combined and whether it matches human opinions. This paper is for researchers in spoken language systems and multilingual AI who need data to study cascaded ASR-LLM setups in Arabic. A reader working on evaluation methods or dataset creation would find the resource and the pipeline description useful. The thinking is clear and the approach addresses a real gap without overclaiming. It should go to peer review so the details can be checked and strengthened.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces WASIL, a dataset of 8,529 in-the-wild Arabic spoken interaction turns including audio, ASR hypotheses, LLM assistant responses, and explicit like/dislike feedback, plus a 2,000-turn test set covering MSA and four major dialects with labels. It describes low-cost gold transcript creation via multi-ASR agreement-guided post-editing, answerability annotation (answerable, ambiguous/needs-clarification, unsupported, not-a-request/noise) to separate intrinsic unanswerability from ASR-induced degradation, and a scalable reference-free evaluation using multi-judge LLM scoring of responses from ASR versus gold transcripts.

Significance. If the post-edited transcripts prove reliable, the dataset would provide a useful resource for diagnosing sources of user dissatisfaction in cascaded ASR-LLM voice systems for Arabic, including dialectal varieties. The release of raw audio, feedback labels, and the proposed reference-free evaluation protocol is a concrete contribution that could support follow-on work on ASR robustness. Credit is given for the dataset release and the practical focus on low-cost transcript generation and scalable evaluation.

major comments (1)

[Abstract and transcript creation description] Abstract and transcript creation description: the multi-ASR agreement-guided post-editing procedure is presented as producing reliable gold transcripts for answerability annotation and reference-free evaluation, yet no WER, edit-distance statistics, or human validation results are reported on any subset of the 8,529 turns or the 2,000-turn test set. This directly affects the central utility claim of separating ASR-induced degradation from intrinsic unanswerability.

minor comments (2)

[Data collection and annotation] Additional details on data collection biases, inter-annotator agreement for answerability labels, and the precise multi-judge LLM scoring protocol (number of judges, aggregation rule) would improve reproducibility.
[Evaluation] Clarify the exact composition and usage of the 2,000-turn test set within the evaluation experiments.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the potential utility of the WASIL dataset, the value of the raw audio and feedback release, and the practical emphasis on low-cost transcript generation and scalable evaluation. We address the single major comment below.

read point-by-point responses

Referee: [Abstract and transcript creation description] Abstract and transcript creation description: the multi-ASR agreement-guided post-editing procedure is presented as producing reliable gold transcripts for answerability annotation and reference-free evaluation, yet no WER, edit-distance statistics, or human validation results are reported on any subset of the 8,529 turns or the 2,000-turn test set. This directly affects the central utility claim of separating ASR-induced degradation from intrinsic unanswerability.

Authors: We agree that the absence of quantitative validation metrics for the post-edited transcripts weakens the central claim that the procedure reliably separates ASR-induced degradation from intrinsic unanswerability. In the revised manuscript we will add (i) WER and character-level edit-distance statistics comparing the final post-edited transcripts against the original multi-ASR hypotheses on both the full 8,529-turn collection and the 2,000-turn test set, and (ii) human validation results (inter-annotator agreement and error analysis) on a stratified random subset of at least 500 turns. These additions will be placed in a new subsection under “Gold Transcript Creation” and will be referenced from the abstract. revision: yes

Circularity Check

0 steps flagged

No circularity: dataset release and procedural description only

full rationale

The paper releases the WASIL dataset of Arabic spoken interactions with audio, ASR hypotheses, responses, and feedback, then describes a multi-ASR agreement-guided post-editing process for low-cost gold transcripts plus reference-free multi-judge LLM scoring for evaluation. No mathematical derivations, predictions, fitted parameters, or closed-form results are claimed anywhere in the manuscript. All steps are empirical data collection and annotation procedures that stand independently without reducing to self-definitions, self-citations, or renamed inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The contributions rest on established practices in ASR post-editing and LLM-as-judge evaluation; no new free parameters or invented entities are introduced.

axioms (2)

domain assumption Agreement among multiple ASR systems can guide post-editing to produce usable gold transcripts at low cost
Invoked to create the low-cost gold transcripts described in the abstract.
domain assumption Multi-judge LLM scoring yields reliable reference-free quality estimates for assistant responses
Used to enable the scalable evaluation of ASR versus gold transcript responses.

pith-pipeline@v0.9.0 · 5732 in / 1350 out tokens · 71940 ms · 2026-05-20T23:11:49.745531+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

61 extracted references · 61 canonical work pages · 5 internal anchors

[1]

Introduction Large language models (LLMs) are increasingly embedded in everyday applications, supporting both text and speech inter- action and enabling open-domain conversational assistants be- yond intent–slot pipelines [1, 2]. In many practical systems, speech interaction is implemented as a cascade in which au- tomatic speech recognition (ASR) first c...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[2]

downstream failures

Related Work 2.1. Interaction Datasets Large-scale logs of human–assistant interactions have enabled empirical analysis of failure modes and preference learning for text-based assistants. WildChat collects one million real ChatGPT interaction logs [12]. Chatbot Arena provides pair- wise human preferences and an Elo-style ranking framework for LLM evaluati...

work page
[3]

religiously wrong

Datasets 3.1. Data Collection In Figure 1, we present WASIL dataset development process. For data collection, we recruited 93 users to interact with an Arabic-centric ASR →LLM system. For both tasks, we used the publicly available Fanar APIs3 [22]. The same user record- ings were also processed with an alternative pipeline that uses Gemini [23] for both A...

work page 2000
[4]

Experimental Setup We benchmark both open and closed models under multiple query input variations, including (i) transcript using ASR vs

Experiments and Results 4.1. Experimental Setup We benchmark both open and closed models under multiple query input variations, including (i) transcript using ASR vs. gold transcripts, and (ii) raw audio. For ASR, as noted earlier, we use Fanar Aura and Gemini, since both have shown com- petitive performance for Arabic in prior work [49]. This setup allow...

work page
[5]

Effect of Input Modality and Transcript Quality Table 6 details Gemini’s performance across different input conditions and rubric dimensions

Discussion 5.1. Effect of Input Modality and Transcript Quality Table 6 details Gemini’s performance across different input conditions and rubric dimensions. We observe a consistent im- provement in overall performance as input quality transitions from direct audio to ASR transcripts, and finally to gold tran- scripts. When reasoning directly from audio, ...

work page
[6]

Conclusion In this paper, we introduced WASIL, to our knowledge the first in-the-wild dataset of Arabic spoken interactions with LLMs, designed to capture realistic conversational conditions under di- alect variation and speech-driven input noise. The dataset in- cludes post-edited transcriptions, user feedback (like and dis- like, with fine-grained categ...

work page
[7]

Voiceassistant- eval: Benchmarking ai assistants across listening, speaking, and viewing,

K. Wang, H. Ren, Z. Lu, M. Zhan, and H. Li, “V oiceassistant- eval: Benchmarking ai assistants across listening, speaking, and viewing,”arXiv preprint arXiv:2509.22651, 2025

work page arXiv 2025
[8]

SOV A-Bench: Benchmarking the Speech Conversa- tion Ability for LLM-based V oice Assistant,

Y . Hou, H. Liu, Y . Wang, Z. Cheng, R. Wu, Q. Gu, Y . Wang, and Y . Wang, “SOV A-Bench: Benchmarking the Speech Conversa- tion Ability for LLM-based V oice Assistant,” inInterspeech 2025, 2025, pp. 5713–5717

work page 2025
[9]

The cascade equivalence hypothesis: When do speech llms behave like asr →llm pipelines?

J. Billa, “The cascade equivalence hypothesis: When do speech llms behave like asr →llm pipelines?” arXiv preprint arXiv:2602.17598, 2026

work page arXiv 2026
[10]

Back transcription as a method for evaluating robustness of natural lan- guage understanding models to speech recognition errors,

M. Kubis, P. Sk ´orzewski, M. Sowa´nski, and T. Zietkiewicz, “Back transcription as a method for evaluating robustness of natural lan- guage understanding models to speech recognition errors,” inPro- ceedings of the 2023 Conference on Empirical Methods in Natu- ral Language Processing. Singapore: Association for Computa- tional Linguistics, Dec. 2023, pp....

work page 2023
[11]

An analysis of dialogue repair in voice assistants,

M. Galbraith, “An analysis of dialogue repair in voice assistants,” arXiv preprint arXiv:2311.03952, 2024

work page arXiv 2024
[12]

Reject or not?: A benchmark for voice assistant query rejection in smart home scenario and an improved method based on llms,

H. Men, Y . Hu, Y . He, Y . Gao, X. Mou, and Y . Xu, “Reject or not?: A benchmark for voice assistant query rejection in smart home scenario and an improved method based on llms,” arXiv preprint arXiv:2512.10257, 2025

work page arXiv 2025
[13]

V oiceBench: Benchmarking llm-based voice assistants,

Y . Chen, X. Yue, C. Zhang, X. Gao, R. T. Tan, and H. Li, “V oiceBench: Benchmarking llm-based voice assistants,”Trans- actions of the Association for Computational Linguistics, vol. 14, pp. 378–398, 2026

work page 2026
[14]

Semantic Distance: A New Metric for ASR Per- formance Analysis Towards Spoken Language Understanding,

S. Kim, A. Arora, D. Le, C.-F. Yeh, C. Fuegen, O. Kalinli, and M. L. Seltzer, “Semantic Distance: A New Metric for ASR Per- formance Analysis Towards Spoken Language Understanding,” in Interspeech 2021, 2021, pp. 1977–1981

work page 2021
[15]

Significant ASR er- ror detection for conversational voice assistants,

J. Harvill, R. Khaziev, S. Li, and R. Cogill, “Significant ASR er- ror detection for conversational voice assistants,” in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2024

work page 2024
[16]

Evaluating Speech Recognition Performance Towards Large Language Model Based V oice Assis- tants,

Z. Liu, S. Kim, and O. Kalinli, “Evaluating Speech Recognition Performance Towards Large Language Model Based V oice Assis- tants,” inInterspeech 2024, 2024, pp. 4099–4103

work page 2024
[17]

Casablanca: Data and models for multidialectal Ara- bic speech recognition,

B. Talafha, K. Kadaoui, S. M. Magdy, M. Habiboullah, C. M. Chafei, A. O. El-Shangiti, H. Zayed, M. C. Tourad, R. Alhamouri, R. Assi, A. Alraeesi, H. Mohamed, F. Alwajih, A. Mohamed, A. El Mekki, E. M. B. Nagoudi, B. D. M. Saadia, H. A. Alsayadi, W. Al-Dhabyani, S. Shatnawi, Y . Ech-chammakhy, A. Makouar, Y . Berrachedi, M. Jarrar, S. Shehata, I. Berrada, ...

work page 2024
[18]

WildChat: 1m chatgpt interaction logs in the wild,

W. Zhao, X. Ren, J. Hessel, C. Cardie, Y . Choi, and Y . Deng, “WildChat: 1m chatgpt interaction logs in the wild,” in The Twelfth International Conference on Learning Representations , 2024

work page 2024
[19]

Chatbot arena: An open platform for evaluating llms by human preference,

W.-L. Chiang, L. Zheng, Y . Sheng, A. N. Angelopoulos, T. Li, D. Li, B. Zhu, H. Zhang, M. Jordan, J. E. Gonzalez et al. , “Chatbot arena: An open platform for evaluating llms by human preference,” in International Conference on Machine Learning . PMLR, 2024, pp. 8359–8388

work page 2024
[20]

Judging LLM-as-a-judge with MT-bench and chatbot arena,

L. Zheng, W.-L. Chiang, Y . Sheng, S. Zhuang, Z. Wu, Y . Zhuang, Z. Lin, Z. Li, D. Li, E. P. Xing, H. Zhang, J. E. Gonzalez, and I. Stoica, “Judging LLM-as-a-judge with MT-bench and chatbot arena,” in Proceedings of the 37th International Conference on Neural Information Processing Systems, 2023, pp. 46 595–46 623

work page 2023
[21]

Sorry, i didn’t catch that! – an in- vestigation of non-understanding errors and recovery strategies,

D. Bohus and A. I. Rudnicky, “Sorry, i didn’t catch that! – an in- vestigation of non-understanding errors and recovery strategies,” in Proceedings of SIGDIAL 2005, 2005

work page 2005
[22]

Detecting out-of-domain utterances addressed to a virtual personal assistant,

G. Tur, A. Deoras, and D. Hakkani-Tur, “Detecting out-of-domain utterances addressed to a virtual personal assistant,” in Proceed- ings of Interspeech 2014, 2014

work page 2014
[23]

A survey on asking clarification questions datasets in conversational systems,

H. A. Rahmani, X. Wang, Y . Feng, Q. Zhang, E. Yilmaz, and A. Lipani, “A survey on asking clarification questions datasets in conversational systems,” inProceedings of the 61st Annual Meet- ing of the Association for Computational Linguistics (Volume 1: Long Papers), A. Rogers, J. Boyd-Graber, and N. Okazaki, Eds. Toronto, Canada: Association for Computat...

work page 2023
[24]

A post-processing system to yield reduced word error rates: Recognizer output voting error reduction (ROVER),

J. G. Fiscus, “A post-processing system to yield reduced word error rates: Recognizer output voting error reduction (ROVER),” in Proceedings of the 1997 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), 1997

work page 1997
[25]

Multi-reference evaluation for dialectal speech recognition system,

A. Ali, P. Bell, and S. Renals, “Multi-reference evaluation for dialectal speech recognition system,” in Proceedings of the 4th Workshop on Arabic Natural Language Processing, 2015

work page 2015
[26]

Best practices for crowdsourc- ing dialectal Arabic speech transcription,

S. Wray, H. Mubarak, and A. Ali, “Best practices for crowdsourc- ing dialectal Arabic speech transcription,” in Proceedings of the 4th Workshop on Arabic Natural Language Processing, 2015

work page 2015
[27]

Better pseudo- labeling with multi-asr fusion and error correction by speechllm,

J. Prakash, B. Kumar, K. Hacioglu, B. Sharma, S. Gopalan, M. Chetlur, S. Venkatesan, and A. Stolcke, “Better pseudo- labeling with multi-asr fusion and error correction by speechllm,” in Interspeech 2025, 2025

work page 2025
[28]

Fanar: An arabic-centric multimodal generative ai platform,

F. Team, U. Abbas, M. S. Ahmad, F. Alam, E. Altinisik, E. As- gari, Y . Boshmaf, S. Boughorbel, S. Chawla, S. Chowdhuryet al., “Fanar: An arabic-centric multimodal generative ai platform,” arXiv:2501.13944, 2025

work page arXiv 2025
[29]

Gemini: A Family of Highly Capable Multimodal Models

G. Team, R. Anil, S. Borgeaud, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth, K. Millican et al., “Gemini: a family of highly capable multimodal models,” arXiv preprint arXiv:2312.11805, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[30]

ALLam: Large language models for arabic and english,

M. S. Bari, Y . Alnumay, N. A. Alzahrani, N. M. Alotaibi, H. A. Alyahya, S. AlRashed, F. A. Mirza, S. Z. Alsubaie, H. A. Alahmed, G. Alabduljabbar, R. Alkhathran, Y . Almushayqih, R. Alnajim, S. Alsubaihi, M. A. Mansour, S. A. Hassan, D. M. Alrubaian, A. Alammari, Z. Alawami, A. Al-Thubaity, A. Abde- lali, J. Kuriakose, A. Abujabal, N. Al-Twairesh, A. Alo...

work page 2025
[31]

Holistic evaluation of language models,

P. Liang, R. Bommasani, T. Lee, D. Tsipras, D. Soylu, M. Ya- sunaga, Y . Zhang, D. Narayanan, Y . Wu, A. Kumar, B. Newman, B. Yuan, B. Yan, C. Zhang, C. Cosgrove, C. D. Manning, C. Re, D. Acosta-Navas, D. A. Hudson, E. Zelikman et al. , “Holistic evaluation of language models,”Transactions on Machine Learn- ing Research, Aug. 2023, accepted by TMLR (OpenReview)

work page 2023
[32]

Training language mod- els to follow instructions with human feedback,

L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. L. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, J. Schulman, J. Hilton, F. Kelton, L. Miller, M. Simens, A. Askell, P. Welinder, P. F. Christiano, J. Leike, and R. Lowe, “Training language mod- els to follow instructions with human feedback,” in Advances in Neural Information Processing Systems...

work page 2022
[33]

Instruction-Following Evaluation for Large Language Models

J. Zhou, T. Lu, S. Mishra, S. Brahma, S. Basu, Y . Luan, D. Zhou, and L. Hou, “Instruction-following evaluation for large language models,”arXiv preprint arXiv:2311.07911, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[34]

TruthfulQA: Measuring how models mimic human falsehoods,

S. Lin, J. Hilton, and O. Evans, “TruthfulQA: Measuring how models mimic human falsehoods,” inProceedings of the 60th An- nual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Dublin, Ireland: Association for Com- putational Linguistics, May 2022, pp. 3214–3252

work page 2022
[35]

Constitutional AI: Harmlessness from AI Feedback

Y . Bai, S. Kadavath, S. Kundu, A. Askell, J. Kernion, A. Jones, A. Chen, A. Goldie, A. Mirhoseini, C. McKinnon, C. Chen, C. Olsson, C. Olah, D. Hernandez, D. Drain, D. Ganguli, D. Li, E. Tran-Johnson, E. Perez et al., “Constitutional AI: Harmless- ness from AI feedback,” arXiv preprint arXiv:2212.08073, Dec. 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[36]

OR-bench: An over-refusal benchmark for large language models,

J. Cui, W.-L. Chiang, I. Stoica, and C.-J. Hsieh, “OR-bench: An over-refusal benchmark for large language models,” in Proceed- ings of the 42nd International Conference on Machine Learn- ing, ser. Proceedings of Machine Learning Research, A. Singh, M. Fazel, D. Hsu, S. Lacoste-Julien, F. Berkenkamp, T. Maharaj, K. Wagstaff, and J. Zhu, Eds., vol. 267. PML...

work page 2025
[37]

SummEval: Re-evaluating summarization evalua- tion,

A. R. Fabbri, W. Kry ´sci´nski, B. McCann, C. Xiong, R. Socher, and D. Radev, “SummEval: Re-evaluating summarization evalua- tion,”Transactions of the Association for Computational Linguis- tics, vol. 9, pp. 391–409, 2021

work page 2021
[38]

Cultural bias and cultural alignment of large language models,

Y . Tao, O. Viberg, R. S. Baker, and R. F. Kizilcec, “Cultural bias and cultural alignment of large language models,” PNAS Nexus, vol. 3, no. 9, p. pgae346, Sep. 2024

work page 2024
[39]

MAGLIC the maghrebi language identification corpus,

K. Jones, K. Walker, C. Caruso, and S. Strassel, “MAGLIC the maghrebi language identification corpus,” in Proceedings of the Speaker and Language Recognition Workshop Odyssey 2024, 2024, pp. 86–90

work page 2024
[40]

ZAEBUC- Spoken a multilingual multidialectal arabic-english speech cor- pus,

I. Hamed, F. Eryani, D. Palfreyman, and N. Habash, “ZAEBUC- Spoken a multilingual multidialectal arabic-english speech cor- pus,” inProceedings of LREC-COLING 2024. ELRA Language Resource Association, 2024, pp. 17 770–17 782

work page 2024
[41]

Survey article: Inter-coder agreement for computational linguistics,

R. Artstein and M. Poesio, “Survey article: Inter-coder agreement for computational linguistics,”Computational linguistics, vol. 34, no. 4, pp. 555–596, 2008

work page 2008
[42]

Detecting ambiguous utterances in an intelligent assistant,

S. Akasaki and M. Sassano, “Detecting ambiguous utterances in an intelligent assistant,” in Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing Indus- try Track. Association for Computational Linguistics, 2024, pp. 386–394

work page 2024
[43]

Out-of-scope intent detection with self-supervision and discriminative training,

L.-M. Zhan, H. Liang, B. Liu, L. Fan, X.-M. Wu, and A. Y . S. Lam, “Out-of-scope intent detection with self-supervision and discriminative training,” inProceedings of the 59th Annual Meet- ing of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing. Association for Computational Linguisti...

work page 2021
[44]

Out-of-domain intent detection considering multi-turn dialogue contexts,

H. Lang, Y . Zheng, B. Hui, F. Huang, and Y . Li, “Out-of-domain intent detection considering multi-turn dialogue contexts,” inPro- ceedings of LREC-COLING 2024 . ELRA Language Resource Association, 2024, pp. 12 539–12 552

work page 2024
[45]

The iso standard for dialogue act an- notation, second edition,

H. Bunt, V . Petukhova, E. Gilmartin, C. Pelachaud, A. Fang, S. Keizer, and L. Prevot, “The iso standard for dialogue act an- notation, second edition,” inProceedings of the 12th LREC, 2020, pp. 549–558

work page 2020
[46]

Computing inter-rater reliability and its variance in the presence of high agreement,

K. L. Gwet, “Computing inter-rater reliability and its variance in the presence of high agreement,”British Journal of Mathematical and Statistical Psychology, vol. 61, no. 1, pp. 29–48, 2008

work page 2008
[47]

Cross- lingual acoustic modeling for dialectal Arabic speech recogni- tion,

M. Elmahdy, R. Gruhn, W. Minker, and S. Abdennadher, “Cross- lingual acoustic modeling for dialectal Arabic speech recogni- tion,” inInterspeech 2010, 2010, pp. 873–876

work page 2010
[48]

Towards One Model to Rule All: Multilingual Strategy for Dialectal Code- Switching Arabic ASR,

S. A. Chowdhury, A. Hussein, A. Abdelali, and A. Ali, “Towards One Model to Rule All: Multilingual Strategy for Dialectal Code- Switching Arabic ASR,” in Interspeech 2021, 2021, pp. 2466– 2470

work page 2021
[49]

Training language models to follow instructions with human feedback,

L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, J. Schulman, J. Hilton, F. Kelton, L. Miller, M. Simens, A. Askell, P. Welinder, P. F. Christiano, J. Leike, and R. Lowe, “Training language models to follow instructions with human feedback,” inAdvances in Neu- ral Information Processing Systems , v...

work page 2022
[50]

TruthfulQA: Measuring how models mimic human falsehoods,

S. Lin, J. Hilton, and O. Evans, “TruthfulQA: Measuring how models mimic human falsehoods,” inProceedings of the 60th An- nual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , S. Muresan, P. Nakov, and A. Villav- icencio, Eds. Dublin, Ireland: Association for Computational Linguistics, May 2022, pp. 3214–3252

work page 2022
[51]

RealToxicityPrompts: Evaluating neural toxic degeneration in language models,

S. Gehman, S. Gururangan, M. Sap, Y . Choi, and N. A. Smith, “RealToxicityPrompts: Evaluating neural toxic degeneration in language models,” in Findings of the Association for Computa- tional Linguistics: EMNLP 2020 , T. Cohn, Y . He, and Y . Liu, Eds. Online: Association for Computational Linguistics, Nov. 2020, pp. 3356–3369

work page 2020
[52]

Survey of the state of the art in natu- ral language generation: Core tasks, applications and evaluation,

A. Gatt and E. J. Krahmer, “Survey of the state of the art in natu- ral language generation: Core tasks, applications and evaluation,” Journal of Artificial Intelligence Research, vol. 61, no. 1, pp. 65– 170, 2018

work page 2018
[53]

Having beer af- ter prayer? measuring cultural bias in large language models,

T. Naous, M. J. Ryan, A. Ritter, and W. Xu, “Having beer af- ter prayer? measuring cultural bias in large language models,” in Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , L.-W. Ku, A. Martins, and V . Srikumar, Eds. Bangkok, Thailand: As- sociation for Computational Linguistics, Aug. 20...

work page 2024
[54]

PalmX 2025: The first shared task on benchmarking LLMs on Arabic and islamic culture,

F. Alwajih, A. El Mekki, H. Mubarak, M. Hawasly, A. Mohamed, and M. Abdul-Mageed, “PalmX 2025: The first shared task on benchmarking LLMs on Arabic and islamic culture,” in Proceed- ings of The Third Arabic Natural Language Processing Confer- ence: Shared Tasks, K. Darwish, A. Ali, I. Abu Farha, S. Touileb, I. Zitouni, A. Abdelali, S. Al-Ghamdi, S. Alkher...

work page 2025
[55]

SpokenNativQA: Multilingual everyday spoken queries for llms,

F. Alam, M. A. Hasan, and S. A. Chowdhury, “SpokenNativQA: Multilingual everyday spoken queries for llms,” inProceedings of the 26th Interspeech Conference (Interspeech 2025). Rotterdam, The Netherlands: ISCA, Aug. 2025

work page 2025
[56]

Qwen2.5-omni technical report,

J. Xu, Z. Guo, J. He, H. Hu, T. He, S. Bai, K. Chen, J. Wang, Y . Fan, K. Dang, B. Zhang, X. Wang, Y . Chu, and J. Lin, “Qwen2.5-omni technical report,” 2025

work page 2025
[57]

OpenAI GPT-5 System Card

A. Singh, A. Fry, A. Perelman, A. Tart, A. Ganesh, A. El-Kishky, A. McLaughlin, A. Low, A. Ostrow, A. Ananthramet al., “Openai gpt-5 system card,”arXiv preprint arXiv:2601.03267, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[58]

GPT-4 technical report,

OpenAI, “GPT-4 technical report,” OpenAI, Tech. Rep., 2023

work page 2023
[59]

Paperbench: Evaluating ai’s ability to replicate ai research,

G. Starace, O. Jaffe, D. Sherburn, J. Aung, J. S. Chan, L. Maksin, R. Dias, E. Mays, B. Kinsella, W. Thompson et al., “Paperbench: Evaluating ai’s ability to replicate ai research,” inICML. PMLR, 2025, pp. 56 843–56 873

work page 2025
[60]

Beyond seeing: Evaluating multimodal llms on tool-enabled image perception, transformation, and reasoning,

X. Guo, U. Tyagi, A. Gosai, P. Vergara, J. Park, E. G. H. Mon- toya, C. B. C. Zhang, B. Hu, Y . He, B. Liuet al., “Beyond seeing: Evaluating multimodal llms on tool-enabled image perception, transformation, and reasoning,”arXiv preprint arXiv:2510.12712, 2025

work page arXiv 2025
[61]

summarize IT

Appendix 8.1. PROMPTS 8.1.1. Judge System Prompt for evaluating Transcription-based queries. You are a STRICT evaluator assessing whether an AI assistant truly understood the user’s intent and produced a high-quality, grounded response. You will receive: - user_query: the user’s original query (may be in Arabic dialect or English). This can be a question,...

work page

[1] [1]

Introduction Large language models (LLMs) are increasingly embedded in everyday applications, supporting both text and speech inter- action and enabling open-domain conversational assistants be- yond intent–slot pipelines [1, 2]. In many practical systems, speech interaction is implemented as a cascade in which au- tomatic speech recognition (ASR) first c...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[2] [2]

downstream failures

Related Work 2.1. Interaction Datasets Large-scale logs of human–assistant interactions have enabled empirical analysis of failure modes and preference learning for text-based assistants. WildChat collects one million real ChatGPT interaction logs [12]. Chatbot Arena provides pair- wise human preferences and an Elo-style ranking framework for LLM evaluati...

work page

[3] [3]

religiously wrong

Datasets 3.1. Data Collection In Figure 1, we present WASIL dataset development process. For data collection, we recruited 93 users to interact with an Arabic-centric ASR →LLM system. For both tasks, we used the publicly available Fanar APIs3 [22]. The same user record- ings were also processed with an alternative pipeline that uses Gemini [23] for both A...

work page 2000

[4] [4]

Experimental Setup We benchmark both open and closed models under multiple query input variations, including (i) transcript using ASR vs

Experiments and Results 4.1. Experimental Setup We benchmark both open and closed models under multiple query input variations, including (i) transcript using ASR vs. gold transcripts, and (ii) raw audio. For ASR, as noted earlier, we use Fanar Aura and Gemini, since both have shown com- petitive performance for Arabic in prior work [49]. This setup allow...

work page

[5] [5]

Effect of Input Modality and Transcript Quality Table 6 details Gemini’s performance across different input conditions and rubric dimensions

Discussion 5.1. Effect of Input Modality and Transcript Quality Table 6 details Gemini’s performance across different input conditions and rubric dimensions. We observe a consistent im- provement in overall performance as input quality transitions from direct audio to ASR transcripts, and finally to gold tran- scripts. When reasoning directly from audio, ...

work page

[6] [6]

Conclusion In this paper, we introduced WASIL, to our knowledge the first in-the-wild dataset of Arabic spoken interactions with LLMs, designed to capture realistic conversational conditions under di- alect variation and speech-driven input noise. The dataset in- cludes post-edited transcriptions, user feedback (like and dis- like, with fine-grained categ...

work page

[7] [7]

Voiceassistant- eval: Benchmarking ai assistants across listening, speaking, and viewing,

K. Wang, H. Ren, Z. Lu, M. Zhan, and H. Li, “V oiceassistant- eval: Benchmarking ai assistants across listening, speaking, and viewing,”arXiv preprint arXiv:2509.22651, 2025

work page arXiv 2025

[8] [8]

SOV A-Bench: Benchmarking the Speech Conversa- tion Ability for LLM-based V oice Assistant,

Y . Hou, H. Liu, Y . Wang, Z. Cheng, R. Wu, Q. Gu, Y . Wang, and Y . Wang, “SOV A-Bench: Benchmarking the Speech Conversa- tion Ability for LLM-based V oice Assistant,” inInterspeech 2025, 2025, pp. 5713–5717

work page 2025

[9] [9]

The cascade equivalence hypothesis: When do speech llms behave like asr →llm pipelines?

J. Billa, “The cascade equivalence hypothesis: When do speech llms behave like asr →llm pipelines?” arXiv preprint arXiv:2602.17598, 2026

work page arXiv 2026

[10] [10]

Back transcription as a method for evaluating robustness of natural lan- guage understanding models to speech recognition errors,

M. Kubis, P. Sk ´orzewski, M. Sowa´nski, and T. Zietkiewicz, “Back transcription as a method for evaluating robustness of natural lan- guage understanding models to speech recognition errors,” inPro- ceedings of the 2023 Conference on Empirical Methods in Natu- ral Language Processing. Singapore: Association for Computa- tional Linguistics, Dec. 2023, pp....

work page 2023

[11] [11]

An analysis of dialogue repair in voice assistants,

M. Galbraith, “An analysis of dialogue repair in voice assistants,” arXiv preprint arXiv:2311.03952, 2024

work page arXiv 2024

[12] [12]

Reject or not?: A benchmark for voice assistant query rejection in smart home scenario and an improved method based on llms,

H. Men, Y . Hu, Y . He, Y . Gao, X. Mou, and Y . Xu, “Reject or not?: A benchmark for voice assistant query rejection in smart home scenario and an improved method based on llms,” arXiv preprint arXiv:2512.10257, 2025

work page arXiv 2025

[13] [13]

V oiceBench: Benchmarking llm-based voice assistants,

Y . Chen, X. Yue, C. Zhang, X. Gao, R. T. Tan, and H. Li, “V oiceBench: Benchmarking llm-based voice assistants,”Trans- actions of the Association for Computational Linguistics, vol. 14, pp. 378–398, 2026

work page 2026

[14] [14]

Semantic Distance: A New Metric for ASR Per- formance Analysis Towards Spoken Language Understanding,

S. Kim, A. Arora, D. Le, C.-F. Yeh, C. Fuegen, O. Kalinli, and M. L. Seltzer, “Semantic Distance: A New Metric for ASR Per- formance Analysis Towards Spoken Language Understanding,” in Interspeech 2021, 2021, pp. 1977–1981

work page 2021

[15] [15]

Significant ASR er- ror detection for conversational voice assistants,

J. Harvill, R. Khaziev, S. Li, and R. Cogill, “Significant ASR er- ror detection for conversational voice assistants,” in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2024

work page 2024

[16] [16]

Evaluating Speech Recognition Performance Towards Large Language Model Based V oice Assis- tants,

Z. Liu, S. Kim, and O. Kalinli, “Evaluating Speech Recognition Performance Towards Large Language Model Based V oice Assis- tants,” inInterspeech 2024, 2024, pp. 4099–4103

work page 2024

[17] [17]

Casablanca: Data and models for multidialectal Ara- bic speech recognition,

B. Talafha, K. Kadaoui, S. M. Magdy, M. Habiboullah, C. M. Chafei, A. O. El-Shangiti, H. Zayed, M. C. Tourad, R. Alhamouri, R. Assi, A. Alraeesi, H. Mohamed, F. Alwajih, A. Mohamed, A. El Mekki, E. M. B. Nagoudi, B. D. M. Saadia, H. A. Alsayadi, W. Al-Dhabyani, S. Shatnawi, Y . Ech-chammakhy, A. Makouar, Y . Berrachedi, M. Jarrar, S. Shehata, I. Berrada, ...

work page 2024

[18] [18]

WildChat: 1m chatgpt interaction logs in the wild,

W. Zhao, X. Ren, J. Hessel, C. Cardie, Y . Choi, and Y . Deng, “WildChat: 1m chatgpt interaction logs in the wild,” in The Twelfth International Conference on Learning Representations , 2024

work page 2024

[19] [19]

Chatbot arena: An open platform for evaluating llms by human preference,

W.-L. Chiang, L. Zheng, Y . Sheng, A. N. Angelopoulos, T. Li, D. Li, B. Zhu, H. Zhang, M. Jordan, J. E. Gonzalez et al. , “Chatbot arena: An open platform for evaluating llms by human preference,” in International Conference on Machine Learning . PMLR, 2024, pp. 8359–8388

work page 2024

[20] [20]

Judging LLM-as-a-judge with MT-bench and chatbot arena,

L. Zheng, W.-L. Chiang, Y . Sheng, S. Zhuang, Z. Wu, Y . Zhuang, Z. Lin, Z. Li, D. Li, E. P. Xing, H. Zhang, J. E. Gonzalez, and I. Stoica, “Judging LLM-as-a-judge with MT-bench and chatbot arena,” in Proceedings of the 37th International Conference on Neural Information Processing Systems, 2023, pp. 46 595–46 623

work page 2023

[21] [21]

Sorry, i didn’t catch that! – an in- vestigation of non-understanding errors and recovery strategies,

D. Bohus and A. I. Rudnicky, “Sorry, i didn’t catch that! – an in- vestigation of non-understanding errors and recovery strategies,” in Proceedings of SIGDIAL 2005, 2005

work page 2005

[22] [22]

Detecting out-of-domain utterances addressed to a virtual personal assistant,

G. Tur, A. Deoras, and D. Hakkani-Tur, “Detecting out-of-domain utterances addressed to a virtual personal assistant,” in Proceed- ings of Interspeech 2014, 2014

work page 2014

[23] [23]

A survey on asking clarification questions datasets in conversational systems,

H. A. Rahmani, X. Wang, Y . Feng, Q. Zhang, E. Yilmaz, and A. Lipani, “A survey on asking clarification questions datasets in conversational systems,” inProceedings of the 61st Annual Meet- ing of the Association for Computational Linguistics (Volume 1: Long Papers), A. Rogers, J. Boyd-Graber, and N. Okazaki, Eds. Toronto, Canada: Association for Computat...

work page 2023

[24] [24]

A post-processing system to yield reduced word error rates: Recognizer output voting error reduction (ROVER),

J. G. Fiscus, “A post-processing system to yield reduced word error rates: Recognizer output voting error reduction (ROVER),” in Proceedings of the 1997 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), 1997

work page 1997

[25] [25]

Multi-reference evaluation for dialectal speech recognition system,

A. Ali, P. Bell, and S. Renals, “Multi-reference evaluation for dialectal speech recognition system,” in Proceedings of the 4th Workshop on Arabic Natural Language Processing, 2015

work page 2015

[26] [26]

Best practices for crowdsourc- ing dialectal Arabic speech transcription,

S. Wray, H. Mubarak, and A. Ali, “Best practices for crowdsourc- ing dialectal Arabic speech transcription,” in Proceedings of the 4th Workshop on Arabic Natural Language Processing, 2015

work page 2015

[27] [27]

Better pseudo- labeling with multi-asr fusion and error correction by speechllm,

J. Prakash, B. Kumar, K. Hacioglu, B. Sharma, S. Gopalan, M. Chetlur, S. Venkatesan, and A. Stolcke, “Better pseudo- labeling with multi-asr fusion and error correction by speechllm,” in Interspeech 2025, 2025

work page 2025

[28] [28]

Fanar: An arabic-centric multimodal generative ai platform,

F. Team, U. Abbas, M. S. Ahmad, F. Alam, E. Altinisik, E. As- gari, Y . Boshmaf, S. Boughorbel, S. Chawla, S. Chowdhuryet al., “Fanar: An arabic-centric multimodal generative ai platform,” arXiv:2501.13944, 2025

work page arXiv 2025

[29] [29]

Gemini: A Family of Highly Capable Multimodal Models

G. Team, R. Anil, S. Borgeaud, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth, K. Millican et al., “Gemini: a family of highly capable multimodal models,” arXiv preprint arXiv:2312.11805, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[30] [30]

ALLam: Large language models for arabic and english,

M. S. Bari, Y . Alnumay, N. A. Alzahrani, N. M. Alotaibi, H. A. Alyahya, S. AlRashed, F. A. Mirza, S. Z. Alsubaie, H. A. Alahmed, G. Alabduljabbar, R. Alkhathran, Y . Almushayqih, R. Alnajim, S. Alsubaihi, M. A. Mansour, S. A. Hassan, D. M. Alrubaian, A. Alammari, Z. Alawami, A. Al-Thubaity, A. Abde- lali, J. Kuriakose, A. Abujabal, N. Al-Twairesh, A. Alo...

work page 2025

[31] [31]

Holistic evaluation of language models,

P. Liang, R. Bommasani, T. Lee, D. Tsipras, D. Soylu, M. Ya- sunaga, Y . Zhang, D. Narayanan, Y . Wu, A. Kumar, B. Newman, B. Yuan, B. Yan, C. Zhang, C. Cosgrove, C. D. Manning, C. Re, D. Acosta-Navas, D. A. Hudson, E. Zelikman et al. , “Holistic evaluation of language models,”Transactions on Machine Learn- ing Research, Aug. 2023, accepted by TMLR (OpenReview)

work page 2023

[32] [32]

Training language mod- els to follow instructions with human feedback,

L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. L. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, J. Schulman, J. Hilton, F. Kelton, L. Miller, M. Simens, A. Askell, P. Welinder, P. F. Christiano, J. Leike, and R. Lowe, “Training language mod- els to follow instructions with human feedback,” in Advances in Neural Information Processing Systems...

work page 2022

[33] [33]

Instruction-Following Evaluation for Large Language Models

J. Zhou, T. Lu, S. Mishra, S. Brahma, S. Basu, Y . Luan, D. Zhou, and L. Hou, “Instruction-following evaluation for large language models,”arXiv preprint arXiv:2311.07911, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[34] [34]

TruthfulQA: Measuring how models mimic human falsehoods,

S. Lin, J. Hilton, and O. Evans, “TruthfulQA: Measuring how models mimic human falsehoods,” inProceedings of the 60th An- nual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Dublin, Ireland: Association for Com- putational Linguistics, May 2022, pp. 3214–3252

work page 2022

[35] [35]

Constitutional AI: Harmlessness from AI Feedback

Y . Bai, S. Kadavath, S. Kundu, A. Askell, J. Kernion, A. Jones, A. Chen, A. Goldie, A. Mirhoseini, C. McKinnon, C. Chen, C. Olsson, C. Olah, D. Hernandez, D. Drain, D. Ganguli, D. Li, E. Tran-Johnson, E. Perez et al., “Constitutional AI: Harmless- ness from AI feedback,” arXiv preprint arXiv:2212.08073, Dec. 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[36] [36]

OR-bench: An over-refusal benchmark for large language models,

J. Cui, W.-L. Chiang, I. Stoica, and C.-J. Hsieh, “OR-bench: An over-refusal benchmark for large language models,” in Proceed- ings of the 42nd International Conference on Machine Learn- ing, ser. Proceedings of Machine Learning Research, A. Singh, M. Fazel, D. Hsu, S. Lacoste-Julien, F. Berkenkamp, T. Maharaj, K. Wagstaff, and J. Zhu, Eds., vol. 267. PML...

work page 2025

[37] [37]

SummEval: Re-evaluating summarization evalua- tion,

A. R. Fabbri, W. Kry ´sci´nski, B. McCann, C. Xiong, R. Socher, and D. Radev, “SummEval: Re-evaluating summarization evalua- tion,”Transactions of the Association for Computational Linguis- tics, vol. 9, pp. 391–409, 2021

work page 2021

[38] [38]

Cultural bias and cultural alignment of large language models,

Y . Tao, O. Viberg, R. S. Baker, and R. F. Kizilcec, “Cultural bias and cultural alignment of large language models,” PNAS Nexus, vol. 3, no. 9, p. pgae346, Sep. 2024

work page 2024

[39] [39]

MAGLIC the maghrebi language identification corpus,

K. Jones, K. Walker, C. Caruso, and S. Strassel, “MAGLIC the maghrebi language identification corpus,” in Proceedings of the Speaker and Language Recognition Workshop Odyssey 2024, 2024, pp. 86–90

work page 2024

[40] [40]

ZAEBUC- Spoken a multilingual multidialectal arabic-english speech cor- pus,

I. Hamed, F. Eryani, D. Palfreyman, and N. Habash, “ZAEBUC- Spoken a multilingual multidialectal arabic-english speech cor- pus,” inProceedings of LREC-COLING 2024. ELRA Language Resource Association, 2024, pp. 17 770–17 782

work page 2024

[41] [41]

Survey article: Inter-coder agreement for computational linguistics,

R. Artstein and M. Poesio, “Survey article: Inter-coder agreement for computational linguistics,”Computational linguistics, vol. 34, no. 4, pp. 555–596, 2008

work page 2008

[42] [42]

Detecting ambiguous utterances in an intelligent assistant,

S. Akasaki and M. Sassano, “Detecting ambiguous utterances in an intelligent assistant,” in Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing Indus- try Track. Association for Computational Linguistics, 2024, pp. 386–394

work page 2024

[43] [43]

Out-of-scope intent detection with self-supervision and discriminative training,

L.-M. Zhan, H. Liang, B. Liu, L. Fan, X.-M. Wu, and A. Y . S. Lam, “Out-of-scope intent detection with self-supervision and discriminative training,” inProceedings of the 59th Annual Meet- ing of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing. Association for Computational Linguisti...

work page 2021

[44] [44]

Out-of-domain intent detection considering multi-turn dialogue contexts,

H. Lang, Y . Zheng, B. Hui, F. Huang, and Y . Li, “Out-of-domain intent detection considering multi-turn dialogue contexts,” inPro- ceedings of LREC-COLING 2024 . ELRA Language Resource Association, 2024, pp. 12 539–12 552

work page 2024

[45] [45]

The iso standard for dialogue act an- notation, second edition,

H. Bunt, V . Petukhova, E. Gilmartin, C. Pelachaud, A. Fang, S. Keizer, and L. Prevot, “The iso standard for dialogue act an- notation, second edition,” inProceedings of the 12th LREC, 2020, pp. 549–558

work page 2020

[46] [46]

Computing inter-rater reliability and its variance in the presence of high agreement,

K. L. Gwet, “Computing inter-rater reliability and its variance in the presence of high agreement,”British Journal of Mathematical and Statistical Psychology, vol. 61, no. 1, pp. 29–48, 2008

work page 2008

[47] [47]

Cross- lingual acoustic modeling for dialectal Arabic speech recogni- tion,

M. Elmahdy, R. Gruhn, W. Minker, and S. Abdennadher, “Cross- lingual acoustic modeling for dialectal Arabic speech recogni- tion,” inInterspeech 2010, 2010, pp. 873–876

work page 2010

[48] [48]

Towards One Model to Rule All: Multilingual Strategy for Dialectal Code- Switching Arabic ASR,

S. A. Chowdhury, A. Hussein, A. Abdelali, and A. Ali, “Towards One Model to Rule All: Multilingual Strategy for Dialectal Code- Switching Arabic ASR,” in Interspeech 2021, 2021, pp. 2466– 2470

work page 2021

[49] [49]

Training language models to follow instructions with human feedback,

L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, J. Schulman, J. Hilton, F. Kelton, L. Miller, M. Simens, A. Askell, P. Welinder, P. F. Christiano, J. Leike, and R. Lowe, “Training language models to follow instructions with human feedback,” inAdvances in Neu- ral Information Processing Systems , v...

work page 2022

[50] [50]

TruthfulQA: Measuring how models mimic human falsehoods,

S. Lin, J. Hilton, and O. Evans, “TruthfulQA: Measuring how models mimic human falsehoods,” inProceedings of the 60th An- nual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , S. Muresan, P. Nakov, and A. Villav- icencio, Eds. Dublin, Ireland: Association for Computational Linguistics, May 2022, pp. 3214–3252

work page 2022

[51] [51]

RealToxicityPrompts: Evaluating neural toxic degeneration in language models,

S. Gehman, S. Gururangan, M. Sap, Y . Choi, and N. A. Smith, “RealToxicityPrompts: Evaluating neural toxic degeneration in language models,” in Findings of the Association for Computa- tional Linguistics: EMNLP 2020 , T. Cohn, Y . He, and Y . Liu, Eds. Online: Association for Computational Linguistics, Nov. 2020, pp. 3356–3369

work page 2020

[52] [52]

Survey of the state of the art in natu- ral language generation: Core tasks, applications and evaluation,

A. Gatt and E. J. Krahmer, “Survey of the state of the art in natu- ral language generation: Core tasks, applications and evaluation,” Journal of Artificial Intelligence Research, vol. 61, no. 1, pp. 65– 170, 2018

work page 2018

[53] [53]

Having beer af- ter prayer? measuring cultural bias in large language models,

T. Naous, M. J. Ryan, A. Ritter, and W. Xu, “Having beer af- ter prayer? measuring cultural bias in large language models,” in Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , L.-W. Ku, A. Martins, and V . Srikumar, Eds. Bangkok, Thailand: As- sociation for Computational Linguistics, Aug. 20...

work page 2024

[54] [54]

PalmX 2025: The first shared task on benchmarking LLMs on Arabic and islamic culture,

F. Alwajih, A. El Mekki, H. Mubarak, M. Hawasly, A. Mohamed, and M. Abdul-Mageed, “PalmX 2025: The first shared task on benchmarking LLMs on Arabic and islamic culture,” in Proceed- ings of The Third Arabic Natural Language Processing Confer- ence: Shared Tasks, K. Darwish, A. Ali, I. Abu Farha, S. Touileb, I. Zitouni, A. Abdelali, S. Al-Ghamdi, S. Alkher...

work page 2025

[55] [55]

SpokenNativQA: Multilingual everyday spoken queries for llms,

F. Alam, M. A. Hasan, and S. A. Chowdhury, “SpokenNativQA: Multilingual everyday spoken queries for llms,” inProceedings of the 26th Interspeech Conference (Interspeech 2025). Rotterdam, The Netherlands: ISCA, Aug. 2025

work page 2025

[56] [56]

Qwen2.5-omni technical report,

J. Xu, Z. Guo, J. He, H. Hu, T. He, S. Bai, K. Chen, J. Wang, Y . Fan, K. Dang, B. Zhang, X. Wang, Y . Chu, and J. Lin, “Qwen2.5-omni technical report,” 2025

work page 2025

[57] [57]

OpenAI GPT-5 System Card

A. Singh, A. Fry, A. Perelman, A. Tart, A. Ganesh, A. El-Kishky, A. McLaughlin, A. Low, A. Ostrow, A. Ananthramet al., “Openai gpt-5 system card,”arXiv preprint arXiv:2601.03267, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[58] [58]

GPT-4 technical report,

OpenAI, “GPT-4 technical report,” OpenAI, Tech. Rep., 2023

work page 2023

[59] [59]

Paperbench: Evaluating ai’s ability to replicate ai research,

G. Starace, O. Jaffe, D. Sherburn, J. Aung, J. S. Chan, L. Maksin, R. Dias, E. Mays, B. Kinsella, W. Thompson et al., “Paperbench: Evaluating ai’s ability to replicate ai research,” inICML. PMLR, 2025, pp. 56 843–56 873

work page 2025

[60] [60]

Beyond seeing: Evaluating multimodal llms on tool-enabled image perception, transformation, and reasoning,

X. Guo, U. Tyagi, A. Gosai, P. Vergara, J. Park, E. G. H. Mon- toya, C. B. C. Zhang, B. Hu, Y . He, B. Liuet al., “Beyond seeing: Evaluating multimodal llms on tool-enabled image perception, transformation, and reasoning,”arXiv preprint arXiv:2510.12712, 2025

work page arXiv 2025

[61] [61]

summarize IT

Appendix 8.1. PROMPTS 8.1.1. Judge System Prompt for evaluating Transcription-based queries. You are a STRICT evaluator assessing whether an AI assistant truly understood the user’s intent and produced a high-quality, grounded response. You will receive: - user_query: the user’s original query (may be in Arabic dialect or English). This can be a question,...

work page