pith. sign in

arxiv: 2605.16364 · v1 · pith:AIL4AKRXnew · submitted 2026-05-09 · 💻 cs.SD · cs.AI· cs.CL

WASIL: In-the-Wild Arabic Spoken Interactions with LLMs

Pith reviewed 2026-05-20 23:11 UTC · model grok-4.3

classification 💻 cs.SD cs.AIcs.CL
keywords Arabic spoken interactionsLLM voice assistantsASR errorsdataset releaseanswerability annotationreference-free evaluationdialectal Arabic
0
0 comments X

The pith

The WASIL dataset captures real Arabic spoken interactions with LLMs to isolate speech recognition errors from other causes of user dissatisfaction.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper releases a dataset of actual user voice interactions with Arabic LLM assistants that includes the audio, ASR output, the assistant's response, and the user's like or dislike rating. The collection has 8,529 turns and a 2,000-turn test set that covers standard Arabic as well as four major dialects. By using agreement among several ASR systems to create low-cost accurate transcripts and by labeling whether each turn is answerable or not, the authors separate cases where the system simply misheard the user from cases where the query was ambiguous, unsupported, or not a request at all. This separation matters because it lets developers see how much of the dissatisfaction comes from the speech recognition step rather than from the language model itself. The paper also describes a way to evaluate the quality of responses using multiple LLM judges without needing a gold standard answer.

Core claim

WASIL provides 8,529 turns of in-the-wild Arabic spoken LLM interactions with audio, ASR hypotheses, responses, and explicit feedback, plus a 2,000-turn test set labeled for MSA and four dialects. Low-cost gold transcripts are created through multi-ASR agreement-guided post-editing, and answerability is annotated to distinguish intrinsic unanswerability from ASR-induced degradation. Scalable reference-free evaluation is outlined using multi-judge LLM scoring for responses based on ASR versus gold transcripts.

What carries the argument

The WASIL dataset, multi-ASR agreement-guided post-editing for gold transcripts, and answerability annotations that isolate ASR effects.

If this is right

  • The feedback and labels allow direct measurement of ASR error impact on user satisfaction in Arabic voice assistants.
  • The dialect-labeled test set supports evaluation across different Arabic varieties.
  • Multi-judge LLM scoring provides a scalable way to compare ASR and gold transcript performance.
  • Answerability categories help exclude non-request turns from quality assessments.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This approach to low-cost transcription could support dataset creation for other languages with limited resources.
  • Developers might use the evaluation method to rapidly iterate on ASR improvements for voice LLMs.
  • The dataset could reveal patterns in how dialects affect interaction success rates.

Load-bearing premise

The multi-ASR agreement post-editing process produces transcripts that are accurate enough to distinguish ASR mistakes from intrinsic query problems.

What would settle it

Independent human transcription of a portion of the data shows significant errors in the gold transcripts, or the LLM judge ratings fail to correlate with human ratings of the responses.

Figures

Figures reproduced from arXiv: 2605.16364 by Firoj Alam, Hamdy Mubarak, Hunzalah Hassan Bhatti, Shammur Absar Chowdhury, Soon-Gyo Jung, Zien Sheikh Ali.

Figure 1
Figure 1. Figure 1: WASIL dataset development process, from multi￾national spoken prompts collection and cascaded model infer￾ence to multi-layer human annotation. Participants were recruited from four Arab countries, Alge￾ria, Egypt, Sudan, and Syria. Data were collected over nine days using daily topic prompts that encouraged diverse interactions, including open discussion (any topic), follow-up questions, cre￾ative writing… view at source ↗
Figure 2
Figure 2. Figure 2: The distribution of cosine similarity scores between Fanar and Gemini transcriptions on the whole dataset (9,304). To further validate this hypothesis, we conduct an addi￾tional analysis using the gold transcriptions. We compute word￾level (WER) and character-level (CER) edit distances between the gold transcriptions and the Fanar ASR outputs to quantify the actual post-editing effort required. Figures 3 a… view at source ↗
Figure 3
Figure 3. Figure 3: Cosine similarity between Fanar and Gemini tran￾scriptions for post-edited utterances, split into higher CER cases that required more edits (left) and lower CER cases that required fewer edits (right). 3.6.6. ASR Errors vs. Response Dislikeness We investigate the potential reasons behind dislike reactions by examining the relationship between ASR similarity scores and user feedback [PITH_FULL_IMAGE:figure… view at source ↗
Figure 4
Figure 4. Figure 4: Cosine similarity between Fanar and Gemini tran￾scriptions for post-edited utterances, split into higher WER cases that required more edits (left) and lower WER cases that required fewer edits (right). 0.0 0.2 0.4 0.6 0.8 1.0 Cosine Similarity 0 1000 2000 3000 4000 5000 Frequency 0% 0% 0% 0% 0% 1% 2% 5% 15% 72% Prompts With Like Response (7,482) 0.0 0.2 0.4 0.6 0.8 1.0 Cosine Similarity 0 100 200 300 400 5… view at source ↗
Figure 5
Figure 5. Figure 5: Distribution of cosine similarity scores between Fanar and Gemini transcriptions for spoken prompts categorized by like (left) and dislike (right) reactions. we provide the gold transcription to Fanar, which results in the correct answer. We also present an example of a prompt without transcrip￾tion errors that still received a dislike reaction (Figure 6c). In this case, the prompt is incomplete and lacks … view at source ↗
Figure 7
Figure 7. Figure 7: Model performance metrics. (a) Direct audio input performance across models. (b) The impact of transcript quality on Average Pass Rate (APR) for cascaded models. pute the fraction of satisfied criteria: si = 1 Ni XNi j=1 1rij . (3) Equivalently, the overall ARS is the average of these per-query pass rates across the entire test set: ARS = 1 N XN i=1 si. (4) 4.3. Results [PITH_FULL_IMAGE:figures/full_fig_p… view at source ↗
read the original abstract

Large Language Models (LLMs) voice assistants are commonly built as cascaded Automatic Speech recognition (ASR) to LLM systems, where recognition errors can distort user intent. Dislikes may also arise from ambiguous, out-of-domain, or non-request turns, making it hard to isolate ASR effects. We release WASIL (it denotes connection or linking in Arabic): in-the-wild Arabic spoken interaction prompts with audio, ASR hypotheses, assistant responses, and explicit like/dislike feedback (8,529 turns; 14.2% dislikes), plus a 2,000-turn test set covering Modern Standard Arabic (MSA) and four major dialects with their labels. We provide low-cost gold transcripts via multi-ASR agreement-guided post-editing and annotate answerability (answerable, ambiguous/needs-clarification, unsupported, not-a-request/noise) to separate intrinsic unanswerability from ASR-induced degradation. Finally, we describe scalable reference-free evaluation of responses from ASR vs. gold transcripts using multi-judge LLM scoring.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces WASIL, a dataset of 8,529 in-the-wild Arabic spoken interaction turns including audio, ASR hypotheses, LLM assistant responses, and explicit like/dislike feedback, plus a 2,000-turn test set covering MSA and four major dialects with labels. It describes low-cost gold transcript creation via multi-ASR agreement-guided post-editing, answerability annotation (answerable, ambiguous/needs-clarification, unsupported, not-a-request/noise) to separate intrinsic unanswerability from ASR-induced degradation, and a scalable reference-free evaluation using multi-judge LLM scoring of responses from ASR versus gold transcripts.

Significance. If the post-edited transcripts prove reliable, the dataset would provide a useful resource for diagnosing sources of user dissatisfaction in cascaded ASR-LLM voice systems for Arabic, including dialectal varieties. The release of raw audio, feedback labels, and the proposed reference-free evaluation protocol is a concrete contribution that could support follow-on work on ASR robustness. Credit is given for the dataset release and the practical focus on low-cost transcript generation and scalable evaluation.

major comments (1)
  1. [Abstract and transcript creation description] Abstract and transcript creation description: the multi-ASR agreement-guided post-editing procedure is presented as producing reliable gold transcripts for answerability annotation and reference-free evaluation, yet no WER, edit-distance statistics, or human validation results are reported on any subset of the 8,529 turns or the 2,000-turn test set. This directly affects the central utility claim of separating ASR-induced degradation from intrinsic unanswerability.
minor comments (2)
  1. [Data collection and annotation] Additional details on data collection biases, inter-annotator agreement for answerability labels, and the precise multi-judge LLM scoring protocol (number of judges, aggregation rule) would improve reproducibility.
  2. [Evaluation] Clarify the exact composition and usage of the 2,000-turn test set within the evaluation experiments.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the potential utility of the WASIL dataset, the value of the raw audio and feedback release, and the practical emphasis on low-cost transcript generation and scalable evaluation. We address the single major comment below.

read point-by-point responses
  1. Referee: [Abstract and transcript creation description] Abstract and transcript creation description: the multi-ASR agreement-guided post-editing procedure is presented as producing reliable gold transcripts for answerability annotation and reference-free evaluation, yet no WER, edit-distance statistics, or human validation results are reported on any subset of the 8,529 turns or the 2,000-turn test set. This directly affects the central utility claim of separating ASR-induced degradation from intrinsic unanswerability.

    Authors: We agree that the absence of quantitative validation metrics for the post-edited transcripts weakens the central claim that the procedure reliably separates ASR-induced degradation from intrinsic unanswerability. In the revised manuscript we will add (i) WER and character-level edit-distance statistics comparing the final post-edited transcripts against the original multi-ASR hypotheses on both the full 8,529-turn collection and the 2,000-turn test set, and (ii) human validation results (inter-annotator agreement and error analysis) on a stratified random subset of at least 500 turns. These additions will be placed in a new subsection under “Gold Transcript Creation” and will be referenced from the abstract. revision: yes

Circularity Check

0 steps flagged

No circularity: dataset release and procedural description only

full rationale

The paper releases the WASIL dataset of Arabic spoken interactions with audio, ASR hypotheses, responses, and feedback, then describes a multi-ASR agreement-guided post-editing process for low-cost gold transcripts plus reference-free multi-judge LLM scoring for evaluation. No mathematical derivations, predictions, fitted parameters, or closed-form results are claimed anywhere in the manuscript. All steps are empirical data collection and annotation procedures that stand independently without reducing to self-definitions, self-citations, or renamed inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The contributions rest on established practices in ASR post-editing and LLM-as-judge evaluation; no new free parameters or invented entities are introduced.

axioms (2)
  • domain assumption Agreement among multiple ASR systems can guide post-editing to produce usable gold transcripts at low cost
    Invoked to create the low-cost gold transcripts described in the abstract.
  • domain assumption Multi-judge LLM scoring yields reliable reference-free quality estimates for assistant responses
    Used to enable the scalable evaluation of ASR versus gold transcript responses.

pith-pipeline@v0.9.0 · 5732 in / 1350 out tokens · 71940 ms · 2026-05-20T23:11:49.745531+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

61 extracted references · 61 canonical work pages · 5 internal anchors

  1. [1]

    Introduction Large language models (LLMs) are increasingly embedded in everyday applications, supporting both text and speech inter- action and enabling open-domain conversational assistants be- yond intent–slot pipelines [1, 2]. In many practical systems, speech interaction is implemented as a cascade in which au- tomatic speech recognition (ASR) first c...

  2. [2]

    downstream failures

    Related Work 2.1. Interaction Datasets Large-scale logs of human–assistant interactions have enabled empirical analysis of failure modes and preference learning for text-based assistants. WildChat collects one million real ChatGPT interaction logs [12]. Chatbot Arena provides pair- wise human preferences and an Elo-style ranking framework for LLM evaluati...

  3. [3]

    religiously wrong

    Datasets 3.1. Data Collection In Figure 1, we present WASIL dataset development process. For data collection, we recruited 93 users to interact with an Arabic-centric ASR →LLM system. For both tasks, we used the publicly available Fanar APIs3 [22]. The same user record- ings were also processed with an alternative pipeline that uses Gemini [23] for both A...

  4. [4]

    Experimental Setup We benchmark both open and closed models under multiple query input variations, including (i) transcript using ASR vs

    Experiments and Results 4.1. Experimental Setup We benchmark both open and closed models under multiple query input variations, including (i) transcript using ASR vs. gold transcripts, and (ii) raw audio. For ASR, as noted earlier, we use Fanar Aura and Gemini, since both have shown com- petitive performance for Arabic in prior work [49]. This setup allow...

  5. [5]

    Effect of Input Modality and Transcript Quality Table 6 details Gemini’s performance across different input conditions and rubric dimensions

    Discussion 5.1. Effect of Input Modality and Transcript Quality Table 6 details Gemini’s performance across different input conditions and rubric dimensions. We observe a consistent im- provement in overall performance as input quality transitions from direct audio to ASR transcripts, and finally to gold tran- scripts. When reasoning directly from audio, ...

  6. [6]

    Conclusion In this paper, we introduced WASIL, to our knowledge the first in-the-wild dataset of Arabic spoken interactions with LLMs, designed to capture realistic conversational conditions under di- alect variation and speech-driven input noise. The dataset in- cludes post-edited transcriptions, user feedback (like and dis- like, with fine-grained categ...

  7. [7]

    Voiceassistant- eval: Benchmarking ai assistants across listening, speaking, and viewing,

    K. Wang, H. Ren, Z. Lu, M. Zhan, and H. Li, “V oiceassistant- eval: Benchmarking ai assistants across listening, speaking, and viewing,”arXiv preprint arXiv:2509.22651, 2025

  8. [8]

    SOV A-Bench: Benchmarking the Speech Conversa- tion Ability for LLM-based V oice Assistant,

    Y . Hou, H. Liu, Y . Wang, Z. Cheng, R. Wu, Q. Gu, Y . Wang, and Y . Wang, “SOV A-Bench: Benchmarking the Speech Conversa- tion Ability for LLM-based V oice Assistant,” inInterspeech 2025, 2025, pp. 5713–5717

  9. [9]

    The cascade equivalence hypothesis: When do speech llms behave like asr →llm pipelines?

    J. Billa, “The cascade equivalence hypothesis: When do speech llms behave like asr →llm pipelines?” arXiv preprint arXiv:2602.17598, 2026

  10. [10]

    Back transcription as a method for evaluating robustness of natural lan- guage understanding models to speech recognition errors,

    M. Kubis, P. Sk ´orzewski, M. Sowa´nski, and T. Zietkiewicz, “Back transcription as a method for evaluating robustness of natural lan- guage understanding models to speech recognition errors,” inPro- ceedings of the 2023 Conference on Empirical Methods in Natu- ral Language Processing. Singapore: Association for Computa- tional Linguistics, Dec. 2023, pp....

  11. [11]

    An analysis of dialogue repair in voice assistants,

    M. Galbraith, “An analysis of dialogue repair in voice assistants,” arXiv preprint arXiv:2311.03952, 2024

  12. [12]

    Reject or not?: A benchmark for voice assistant query rejection in smart home scenario and an improved method based on llms,

    H. Men, Y . Hu, Y . He, Y . Gao, X. Mou, and Y . Xu, “Reject or not?: A benchmark for voice assistant query rejection in smart home scenario and an improved method based on llms,” arXiv preprint arXiv:2512.10257, 2025

  13. [13]

    V oiceBench: Benchmarking llm-based voice assistants,

    Y . Chen, X. Yue, C. Zhang, X. Gao, R. T. Tan, and H. Li, “V oiceBench: Benchmarking llm-based voice assistants,”Trans- actions of the Association for Computational Linguistics, vol. 14, pp. 378–398, 2026

  14. [14]

    Semantic Distance: A New Metric for ASR Per- formance Analysis Towards Spoken Language Understanding,

    S. Kim, A. Arora, D. Le, C.-F. Yeh, C. Fuegen, O. Kalinli, and M. L. Seltzer, “Semantic Distance: A New Metric for ASR Per- formance Analysis Towards Spoken Language Understanding,” in Interspeech 2021, 2021, pp. 1977–1981

  15. [15]

    Significant ASR er- ror detection for conversational voice assistants,

    J. Harvill, R. Khaziev, S. Li, and R. Cogill, “Significant ASR er- ror detection for conversational voice assistants,” in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2024

  16. [16]

    Evaluating Speech Recognition Performance Towards Large Language Model Based V oice Assis- tants,

    Z. Liu, S. Kim, and O. Kalinli, “Evaluating Speech Recognition Performance Towards Large Language Model Based V oice Assis- tants,” inInterspeech 2024, 2024, pp. 4099–4103

  17. [17]

    Casablanca: Data and models for multidialectal Ara- bic speech recognition,

    B. Talafha, K. Kadaoui, S. M. Magdy, M. Habiboullah, C. M. Chafei, A. O. El-Shangiti, H. Zayed, M. C. Tourad, R. Alhamouri, R. Assi, A. Alraeesi, H. Mohamed, F. Alwajih, A. Mohamed, A. El Mekki, E. M. B. Nagoudi, B. D. M. Saadia, H. A. Alsayadi, W. Al-Dhabyani, S. Shatnawi, Y . Ech-chammakhy, A. Makouar, Y . Berrachedi, M. Jarrar, S. Shehata, I. Berrada, ...

  18. [18]

    WildChat: 1m chatgpt interaction logs in the wild,

    W. Zhao, X. Ren, J. Hessel, C. Cardie, Y . Choi, and Y . Deng, “WildChat: 1m chatgpt interaction logs in the wild,” in The Twelfth International Conference on Learning Representations , 2024

  19. [19]

    Chatbot arena: An open platform for evaluating llms by human preference,

    W.-L. Chiang, L. Zheng, Y . Sheng, A. N. Angelopoulos, T. Li, D. Li, B. Zhu, H. Zhang, M. Jordan, J. E. Gonzalez et al. , “Chatbot arena: An open platform for evaluating llms by human preference,” in International Conference on Machine Learning . PMLR, 2024, pp. 8359–8388

  20. [20]

    Judging LLM-as-a-judge with MT-bench and chatbot arena,

    L. Zheng, W.-L. Chiang, Y . Sheng, S. Zhuang, Z. Wu, Y . Zhuang, Z. Lin, Z. Li, D. Li, E. P. Xing, H. Zhang, J. E. Gonzalez, and I. Stoica, “Judging LLM-as-a-judge with MT-bench and chatbot arena,” in Proceedings of the 37th International Conference on Neural Information Processing Systems, 2023, pp. 46 595–46 623

  21. [21]

    Sorry, i didn’t catch that! – an in- vestigation of non-understanding errors and recovery strategies,

    D. Bohus and A. I. Rudnicky, “Sorry, i didn’t catch that! – an in- vestigation of non-understanding errors and recovery strategies,” in Proceedings of SIGDIAL 2005, 2005

  22. [22]

    Detecting out-of-domain utterances addressed to a virtual personal assistant,

    G. Tur, A. Deoras, and D. Hakkani-Tur, “Detecting out-of-domain utterances addressed to a virtual personal assistant,” in Proceed- ings of Interspeech 2014, 2014

  23. [23]

    A survey on asking clarification questions datasets in conversational systems,

    H. A. Rahmani, X. Wang, Y . Feng, Q. Zhang, E. Yilmaz, and A. Lipani, “A survey on asking clarification questions datasets in conversational systems,” inProceedings of the 61st Annual Meet- ing of the Association for Computational Linguistics (Volume 1: Long Papers), A. Rogers, J. Boyd-Graber, and N. Okazaki, Eds. Toronto, Canada: Association for Computat...

  24. [24]

    A post-processing system to yield reduced word error rates: Recognizer output voting error reduction (ROVER),

    J. G. Fiscus, “A post-processing system to yield reduced word error rates: Recognizer output voting error reduction (ROVER),” in Proceedings of the 1997 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), 1997

  25. [25]

    Multi-reference evaluation for dialectal speech recognition system,

    A. Ali, P. Bell, and S. Renals, “Multi-reference evaluation for dialectal speech recognition system,” in Proceedings of the 4th Workshop on Arabic Natural Language Processing, 2015

  26. [26]

    Best practices for crowdsourc- ing dialectal Arabic speech transcription,

    S. Wray, H. Mubarak, and A. Ali, “Best practices for crowdsourc- ing dialectal Arabic speech transcription,” in Proceedings of the 4th Workshop on Arabic Natural Language Processing, 2015

  27. [27]

    Better pseudo- labeling with multi-asr fusion and error correction by speechllm,

    J. Prakash, B. Kumar, K. Hacioglu, B. Sharma, S. Gopalan, M. Chetlur, S. Venkatesan, and A. Stolcke, “Better pseudo- labeling with multi-asr fusion and error correction by speechllm,” in Interspeech 2025, 2025

  28. [28]

    Fanar: An arabic-centric multimodal generative ai platform,

    F. Team, U. Abbas, M. S. Ahmad, F. Alam, E. Altinisik, E. As- gari, Y . Boshmaf, S. Boughorbel, S. Chawla, S. Chowdhuryet al., “Fanar: An arabic-centric multimodal generative ai platform,” arXiv:2501.13944, 2025

  29. [29]

    Gemini: A Family of Highly Capable Multimodal Models

    G. Team, R. Anil, S. Borgeaud, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth, K. Millican et al., “Gemini: a family of highly capable multimodal models,” arXiv preprint arXiv:2312.11805, 2023

  30. [30]

    ALLam: Large language models for arabic and english,

    M. S. Bari, Y . Alnumay, N. A. Alzahrani, N. M. Alotaibi, H. A. Alyahya, S. AlRashed, F. A. Mirza, S. Z. Alsubaie, H. A. Alahmed, G. Alabduljabbar, R. Alkhathran, Y . Almushayqih, R. Alnajim, S. Alsubaihi, M. A. Mansour, S. A. Hassan, D. M. Alrubaian, A. Alammari, Z. Alawami, A. Al-Thubaity, A. Abde- lali, J. Kuriakose, A. Abujabal, N. Al-Twairesh, A. Alo...

  31. [31]

    Holistic evaluation of language models,

    P. Liang, R. Bommasani, T. Lee, D. Tsipras, D. Soylu, M. Ya- sunaga, Y . Zhang, D. Narayanan, Y . Wu, A. Kumar, B. Newman, B. Yuan, B. Yan, C. Zhang, C. Cosgrove, C. D. Manning, C. Re, D. Acosta-Navas, D. A. Hudson, E. Zelikman et al. , “Holistic evaluation of language models,”Transactions on Machine Learn- ing Research, Aug. 2023, accepted by TMLR (OpenReview)

  32. [32]

    Training language mod- els to follow instructions with human feedback,

    L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. L. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, J. Schulman, J. Hilton, F. Kelton, L. Miller, M. Simens, A. Askell, P. Welinder, P. F. Christiano, J. Leike, and R. Lowe, “Training language mod- els to follow instructions with human feedback,” in Advances in Neural Information Processing Systems...

  33. [33]

    Instruction-Following Evaluation for Large Language Models

    J. Zhou, T. Lu, S. Mishra, S. Brahma, S. Basu, Y . Luan, D. Zhou, and L. Hou, “Instruction-following evaluation for large language models,”arXiv preprint arXiv:2311.07911, 2023

  34. [34]

    TruthfulQA: Measuring how models mimic human falsehoods,

    S. Lin, J. Hilton, and O. Evans, “TruthfulQA: Measuring how models mimic human falsehoods,” inProceedings of the 60th An- nual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Dublin, Ireland: Association for Com- putational Linguistics, May 2022, pp. 3214–3252

  35. [35]

    Constitutional AI: Harmlessness from AI Feedback

    Y . Bai, S. Kadavath, S. Kundu, A. Askell, J. Kernion, A. Jones, A. Chen, A. Goldie, A. Mirhoseini, C. McKinnon, C. Chen, C. Olsson, C. Olah, D. Hernandez, D. Drain, D. Ganguli, D. Li, E. Tran-Johnson, E. Perez et al., “Constitutional AI: Harmless- ness from AI feedback,” arXiv preprint arXiv:2212.08073, Dec. 2022

  36. [36]

    OR-bench: An over-refusal benchmark for large language models,

    J. Cui, W.-L. Chiang, I. Stoica, and C.-J. Hsieh, “OR-bench: An over-refusal benchmark for large language models,” in Proceed- ings of the 42nd International Conference on Machine Learn- ing, ser. Proceedings of Machine Learning Research, A. Singh, M. Fazel, D. Hsu, S. Lacoste-Julien, F. Berkenkamp, T. Maharaj, K. Wagstaff, and J. Zhu, Eds., vol. 267. PML...

  37. [37]

    SummEval: Re-evaluating summarization evalua- tion,

    A. R. Fabbri, W. Kry ´sci´nski, B. McCann, C. Xiong, R. Socher, and D. Radev, “SummEval: Re-evaluating summarization evalua- tion,”Transactions of the Association for Computational Linguis- tics, vol. 9, pp. 391–409, 2021

  38. [38]

    Cultural bias and cultural alignment of large language models,

    Y . Tao, O. Viberg, R. S. Baker, and R. F. Kizilcec, “Cultural bias and cultural alignment of large language models,” PNAS Nexus, vol. 3, no. 9, p. pgae346, Sep. 2024

  39. [39]

    MAGLIC the maghrebi language identification corpus,

    K. Jones, K. Walker, C. Caruso, and S. Strassel, “MAGLIC the maghrebi language identification corpus,” in Proceedings of the Speaker and Language Recognition Workshop Odyssey 2024, 2024, pp. 86–90

  40. [40]

    ZAEBUC- Spoken a multilingual multidialectal arabic-english speech cor- pus,

    I. Hamed, F. Eryani, D. Palfreyman, and N. Habash, “ZAEBUC- Spoken a multilingual multidialectal arabic-english speech cor- pus,” inProceedings of LREC-COLING 2024. ELRA Language Resource Association, 2024, pp. 17 770–17 782

  41. [41]

    Survey article: Inter-coder agreement for computational linguistics,

    R. Artstein and M. Poesio, “Survey article: Inter-coder agreement for computational linguistics,”Computational linguistics, vol. 34, no. 4, pp. 555–596, 2008

  42. [42]

    Detecting ambiguous utterances in an intelligent assistant,

    S. Akasaki and M. Sassano, “Detecting ambiguous utterances in an intelligent assistant,” in Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing Indus- try Track. Association for Computational Linguistics, 2024, pp. 386–394

  43. [43]

    Out-of-scope intent detection with self-supervision and discriminative training,

    L.-M. Zhan, H. Liang, B. Liu, L. Fan, X.-M. Wu, and A. Y . S. Lam, “Out-of-scope intent detection with self-supervision and discriminative training,” inProceedings of the 59th Annual Meet- ing of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing. Association for Computational Linguisti...

  44. [44]

    Out-of-domain intent detection considering multi-turn dialogue contexts,

    H. Lang, Y . Zheng, B. Hui, F. Huang, and Y . Li, “Out-of-domain intent detection considering multi-turn dialogue contexts,” inPro- ceedings of LREC-COLING 2024 . ELRA Language Resource Association, 2024, pp. 12 539–12 552

  45. [45]

    The iso standard for dialogue act an- notation, second edition,

    H. Bunt, V . Petukhova, E. Gilmartin, C. Pelachaud, A. Fang, S. Keizer, and L. Prevot, “The iso standard for dialogue act an- notation, second edition,” inProceedings of the 12th LREC, 2020, pp. 549–558

  46. [46]

    Computing inter-rater reliability and its variance in the presence of high agreement,

    K. L. Gwet, “Computing inter-rater reliability and its variance in the presence of high agreement,”British Journal of Mathematical and Statistical Psychology, vol. 61, no. 1, pp. 29–48, 2008

  47. [47]

    Cross- lingual acoustic modeling for dialectal Arabic speech recogni- tion,

    M. Elmahdy, R. Gruhn, W. Minker, and S. Abdennadher, “Cross- lingual acoustic modeling for dialectal Arabic speech recogni- tion,” inInterspeech 2010, 2010, pp. 873–876

  48. [48]

    Towards One Model to Rule All: Multilingual Strategy for Dialectal Code- Switching Arabic ASR,

    S. A. Chowdhury, A. Hussein, A. Abdelali, and A. Ali, “Towards One Model to Rule All: Multilingual Strategy for Dialectal Code- Switching Arabic ASR,” in Interspeech 2021, 2021, pp. 2466– 2470

  49. [49]

    Training language models to follow instructions with human feedback,

    L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, J. Schulman, J. Hilton, F. Kelton, L. Miller, M. Simens, A. Askell, P. Welinder, P. F. Christiano, J. Leike, and R. Lowe, “Training language models to follow instructions with human feedback,” inAdvances in Neu- ral Information Processing Systems , v...

  50. [50]

    TruthfulQA: Measuring how models mimic human falsehoods,

    S. Lin, J. Hilton, and O. Evans, “TruthfulQA: Measuring how models mimic human falsehoods,” inProceedings of the 60th An- nual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , S. Muresan, P. Nakov, and A. Villav- icencio, Eds. Dublin, Ireland: Association for Computational Linguistics, May 2022, pp. 3214–3252

  51. [51]

    RealToxicityPrompts: Evaluating neural toxic degeneration in language models,

    S. Gehman, S. Gururangan, M. Sap, Y . Choi, and N. A. Smith, “RealToxicityPrompts: Evaluating neural toxic degeneration in language models,” in Findings of the Association for Computa- tional Linguistics: EMNLP 2020 , T. Cohn, Y . He, and Y . Liu, Eds. Online: Association for Computational Linguistics, Nov. 2020, pp. 3356–3369

  52. [52]

    Survey of the state of the art in natu- ral language generation: Core tasks, applications and evaluation,

    A. Gatt and E. J. Krahmer, “Survey of the state of the art in natu- ral language generation: Core tasks, applications and evaluation,” Journal of Artificial Intelligence Research, vol. 61, no. 1, pp. 65– 170, 2018

  53. [53]

    Having beer af- ter prayer? measuring cultural bias in large language models,

    T. Naous, M. J. Ryan, A. Ritter, and W. Xu, “Having beer af- ter prayer? measuring cultural bias in large language models,” in Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , L.-W. Ku, A. Martins, and V . Srikumar, Eds. Bangkok, Thailand: As- sociation for Computational Linguistics, Aug. 20...

  54. [54]

    PalmX 2025: The first shared task on benchmarking LLMs on Arabic and islamic culture,

    F. Alwajih, A. El Mekki, H. Mubarak, M. Hawasly, A. Mohamed, and M. Abdul-Mageed, “PalmX 2025: The first shared task on benchmarking LLMs on Arabic and islamic culture,” in Proceed- ings of The Third Arabic Natural Language Processing Confer- ence: Shared Tasks, K. Darwish, A. Ali, I. Abu Farha, S. Touileb, I. Zitouni, A. Abdelali, S. Al-Ghamdi, S. Alkher...

  55. [55]

    SpokenNativQA: Multilingual everyday spoken queries for llms,

    F. Alam, M. A. Hasan, and S. A. Chowdhury, “SpokenNativQA: Multilingual everyday spoken queries for llms,” inProceedings of the 26th Interspeech Conference (Interspeech 2025). Rotterdam, The Netherlands: ISCA, Aug. 2025

  56. [56]

    Qwen2.5-omni technical report,

    J. Xu, Z. Guo, J. He, H. Hu, T. He, S. Bai, K. Chen, J. Wang, Y . Fan, K. Dang, B. Zhang, X. Wang, Y . Chu, and J. Lin, “Qwen2.5-omni technical report,” 2025

  57. [57]

    OpenAI GPT-5 System Card

    A. Singh, A. Fry, A. Perelman, A. Tart, A. Ganesh, A. El-Kishky, A. McLaughlin, A. Low, A. Ostrow, A. Ananthramet al., “Openai gpt-5 system card,”arXiv preprint arXiv:2601.03267, 2025

  58. [58]

    GPT-4 technical report,

    OpenAI, “GPT-4 technical report,” OpenAI, Tech. Rep., 2023

  59. [59]

    Paperbench: Evaluating ai’s ability to replicate ai research,

    G. Starace, O. Jaffe, D. Sherburn, J. Aung, J. S. Chan, L. Maksin, R. Dias, E. Mays, B. Kinsella, W. Thompson et al., “Paperbench: Evaluating ai’s ability to replicate ai research,” inICML. PMLR, 2025, pp. 56 843–56 873

  60. [60]

    Beyond seeing: Evaluating multimodal llms on tool-enabled image perception, transformation, and reasoning,

    X. Guo, U. Tyagi, A. Gosai, P. Vergara, J. Park, E. G. H. Mon- toya, C. B. C. Zhang, B. Hu, Y . He, B. Liuet al., “Beyond seeing: Evaluating multimodal llms on tool-enabled image perception, transformation, and reasoning,”arXiv preprint arXiv:2510.12712, 2025

  61. [61]

    summarize IT

    Appendix 8.1. PROMPTS 8.1.1. Judge System Prompt for evaluating Transcription-based queries. You are a STRICT evaluator assessing whether an AI assistant truly understood the user’s intent and produced a high-quality, grounded response. You will receive: - user_query: the user’s original query (may be in Arabic dialect or English). This can be a question,...