WASIL: In-the-Wild Arabic Spoken Interactions with LLMs
Pith reviewed 2026-05-20 23:11 UTC · model grok-4.3
The pith
The WASIL dataset captures real Arabic spoken interactions with LLMs to isolate speech recognition errors from other causes of user dissatisfaction.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
WASIL provides 8,529 turns of in-the-wild Arabic spoken LLM interactions with audio, ASR hypotheses, responses, and explicit feedback, plus a 2,000-turn test set labeled for MSA and four dialects. Low-cost gold transcripts are created through multi-ASR agreement-guided post-editing, and answerability is annotated to distinguish intrinsic unanswerability from ASR-induced degradation. Scalable reference-free evaluation is outlined using multi-judge LLM scoring for responses based on ASR versus gold transcripts.
What carries the argument
The WASIL dataset, multi-ASR agreement-guided post-editing for gold transcripts, and answerability annotations that isolate ASR effects.
If this is right
- The feedback and labels allow direct measurement of ASR error impact on user satisfaction in Arabic voice assistants.
- The dialect-labeled test set supports evaluation across different Arabic varieties.
- Multi-judge LLM scoring provides a scalable way to compare ASR and gold transcript performance.
- Answerability categories help exclude non-request turns from quality assessments.
Where Pith is reading between the lines
- This approach to low-cost transcription could support dataset creation for other languages with limited resources.
- Developers might use the evaluation method to rapidly iterate on ASR improvements for voice LLMs.
- The dataset could reveal patterns in how dialects affect interaction success rates.
Load-bearing premise
The multi-ASR agreement post-editing process produces transcripts that are accurate enough to distinguish ASR mistakes from intrinsic query problems.
What would settle it
Independent human transcription of a portion of the data shows significant errors in the gold transcripts, or the LLM judge ratings fail to correlate with human ratings of the responses.
Figures
read the original abstract
Large Language Models (LLMs) voice assistants are commonly built as cascaded Automatic Speech recognition (ASR) to LLM systems, where recognition errors can distort user intent. Dislikes may also arise from ambiguous, out-of-domain, or non-request turns, making it hard to isolate ASR effects. We release WASIL (it denotes connection or linking in Arabic): in-the-wild Arabic spoken interaction prompts with audio, ASR hypotheses, assistant responses, and explicit like/dislike feedback (8,529 turns; 14.2% dislikes), plus a 2,000-turn test set covering Modern Standard Arabic (MSA) and four major dialects with their labels. We provide low-cost gold transcripts via multi-ASR agreement-guided post-editing and annotate answerability (answerable, ambiguous/needs-clarification, unsupported, not-a-request/noise) to separate intrinsic unanswerability from ASR-induced degradation. Finally, we describe scalable reference-free evaluation of responses from ASR vs. gold transcripts using multi-judge LLM scoring.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces WASIL, a dataset of 8,529 in-the-wild Arabic spoken interaction turns including audio, ASR hypotheses, LLM assistant responses, and explicit like/dislike feedback, plus a 2,000-turn test set covering MSA and four major dialects with labels. It describes low-cost gold transcript creation via multi-ASR agreement-guided post-editing, answerability annotation (answerable, ambiguous/needs-clarification, unsupported, not-a-request/noise) to separate intrinsic unanswerability from ASR-induced degradation, and a scalable reference-free evaluation using multi-judge LLM scoring of responses from ASR versus gold transcripts.
Significance. If the post-edited transcripts prove reliable, the dataset would provide a useful resource for diagnosing sources of user dissatisfaction in cascaded ASR-LLM voice systems for Arabic, including dialectal varieties. The release of raw audio, feedback labels, and the proposed reference-free evaluation protocol is a concrete contribution that could support follow-on work on ASR robustness. Credit is given for the dataset release and the practical focus on low-cost transcript generation and scalable evaluation.
major comments (1)
- [Abstract and transcript creation description] Abstract and transcript creation description: the multi-ASR agreement-guided post-editing procedure is presented as producing reliable gold transcripts for answerability annotation and reference-free evaluation, yet no WER, edit-distance statistics, or human validation results are reported on any subset of the 8,529 turns or the 2,000-turn test set. This directly affects the central utility claim of separating ASR-induced degradation from intrinsic unanswerability.
minor comments (2)
- [Data collection and annotation] Additional details on data collection biases, inter-annotator agreement for answerability labels, and the precise multi-judge LLM scoring protocol (number of judges, aggregation rule) would improve reproducibility.
- [Evaluation] Clarify the exact composition and usage of the 2,000-turn test set within the evaluation experiments.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and for recognizing the potential utility of the WASIL dataset, the value of the raw audio and feedback release, and the practical emphasis on low-cost transcript generation and scalable evaluation. We address the single major comment below.
read point-by-point responses
-
Referee: [Abstract and transcript creation description] Abstract and transcript creation description: the multi-ASR agreement-guided post-editing procedure is presented as producing reliable gold transcripts for answerability annotation and reference-free evaluation, yet no WER, edit-distance statistics, or human validation results are reported on any subset of the 8,529 turns or the 2,000-turn test set. This directly affects the central utility claim of separating ASR-induced degradation from intrinsic unanswerability.
Authors: We agree that the absence of quantitative validation metrics for the post-edited transcripts weakens the central claim that the procedure reliably separates ASR-induced degradation from intrinsic unanswerability. In the revised manuscript we will add (i) WER and character-level edit-distance statistics comparing the final post-edited transcripts against the original multi-ASR hypotheses on both the full 8,529-turn collection and the 2,000-turn test set, and (ii) human validation results (inter-annotator agreement and error analysis) on a stratified random subset of at least 500 turns. These additions will be placed in a new subsection under “Gold Transcript Creation” and will be referenced from the abstract. revision: yes
Circularity Check
No circularity: dataset release and procedural description only
full rationale
The paper releases the WASIL dataset of Arabic spoken interactions with audio, ASR hypotheses, responses, and feedback, then describes a multi-ASR agreement-guided post-editing process for low-cost gold transcripts plus reference-free multi-judge LLM scoring for evaluation. No mathematical derivations, predictions, fitted parameters, or closed-form results are claimed anywhere in the manuscript. All steps are empirical data collection and annotation procedures that stand independently without reducing to self-definitions, self-citations, or renamed inputs by construction.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Agreement among multiple ASR systems can guide post-editing to produce usable gold transcripts at low cost
- domain assumption Multi-judge LLM scoring yields reliable reference-free quality estimates for assistant responses
Reference graph
Works this paper leans on
-
[1]
Introduction Large language models (LLMs) are increasingly embedded in everyday applications, supporting both text and speech inter- action and enabling open-domain conversational assistants be- yond intent–slot pipelines [1, 2]. In many practical systems, speech interaction is implemented as a cascade in which au- tomatic speech recognition (ASR) first c...
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[2]
Related Work 2.1. Interaction Datasets Large-scale logs of human–assistant interactions have enabled empirical analysis of failure modes and preference learning for text-based assistants. WildChat collects one million real ChatGPT interaction logs [12]. Chatbot Arena provides pair- wise human preferences and an Elo-style ranking framework for LLM evaluati...
-
[3]
Datasets 3.1. Data Collection In Figure 1, we present WASIL dataset development process. For data collection, we recruited 93 users to interact with an Arabic-centric ASR →LLM system. For both tasks, we used the publicly available Fanar APIs3 [22]. The same user record- ings were also processed with an alternative pipeline that uses Gemini [23] for both A...
work page 2000
-
[4]
Experiments and Results 4.1. Experimental Setup We benchmark both open and closed models under multiple query input variations, including (i) transcript using ASR vs. gold transcripts, and (ii) raw audio. For ASR, as noted earlier, we use Fanar Aura and Gemini, since both have shown com- petitive performance for Arabic in prior work [49]. This setup allow...
-
[5]
Discussion 5.1. Effect of Input Modality and Transcript Quality Table 6 details Gemini’s performance across different input conditions and rubric dimensions. We observe a consistent im- provement in overall performance as input quality transitions from direct audio to ASR transcripts, and finally to gold tran- scripts. When reasoning directly from audio, ...
-
[6]
Conclusion In this paper, we introduced WASIL, to our knowledge the first in-the-wild dataset of Arabic spoken interactions with LLMs, designed to capture realistic conversational conditions under di- alect variation and speech-driven input noise. The dataset in- cludes post-edited transcriptions, user feedback (like and dis- like, with fine-grained categ...
-
[7]
Voiceassistant- eval: Benchmarking ai assistants across listening, speaking, and viewing,
K. Wang, H. Ren, Z. Lu, M. Zhan, and H. Li, “V oiceassistant- eval: Benchmarking ai assistants across listening, speaking, and viewing,”arXiv preprint arXiv:2509.22651, 2025
-
[8]
SOV A-Bench: Benchmarking the Speech Conversa- tion Ability for LLM-based V oice Assistant,
Y . Hou, H. Liu, Y . Wang, Z. Cheng, R. Wu, Q. Gu, Y . Wang, and Y . Wang, “SOV A-Bench: Benchmarking the Speech Conversa- tion Ability for LLM-based V oice Assistant,” inInterspeech 2025, 2025, pp. 5713–5717
work page 2025
-
[9]
The cascade equivalence hypothesis: When do speech llms behave like asr →llm pipelines?
J. Billa, “The cascade equivalence hypothesis: When do speech llms behave like asr →llm pipelines?” arXiv preprint arXiv:2602.17598, 2026
-
[10]
M. Kubis, P. Sk ´orzewski, M. Sowa´nski, and T. Zietkiewicz, “Back transcription as a method for evaluating robustness of natural lan- guage understanding models to speech recognition errors,” inPro- ceedings of the 2023 Conference on Empirical Methods in Natu- ral Language Processing. Singapore: Association for Computa- tional Linguistics, Dec. 2023, pp....
work page 2023
-
[11]
An analysis of dialogue repair in voice assistants,
M. Galbraith, “An analysis of dialogue repair in voice assistants,” arXiv preprint arXiv:2311.03952, 2024
-
[12]
H. Men, Y . Hu, Y . He, Y . Gao, X. Mou, and Y . Xu, “Reject or not?: A benchmark for voice assistant query rejection in smart home scenario and an improved method based on llms,” arXiv preprint arXiv:2512.10257, 2025
-
[13]
V oiceBench: Benchmarking llm-based voice assistants,
Y . Chen, X. Yue, C. Zhang, X. Gao, R. T. Tan, and H. Li, “V oiceBench: Benchmarking llm-based voice assistants,”Trans- actions of the Association for Computational Linguistics, vol. 14, pp. 378–398, 2026
work page 2026
-
[14]
S. Kim, A. Arora, D. Le, C.-F. Yeh, C. Fuegen, O. Kalinli, and M. L. Seltzer, “Semantic Distance: A New Metric for ASR Per- formance Analysis Towards Spoken Language Understanding,” in Interspeech 2021, 2021, pp. 1977–1981
work page 2021
-
[15]
Significant ASR er- ror detection for conversational voice assistants,
J. Harvill, R. Khaziev, S. Li, and R. Cogill, “Significant ASR er- ror detection for conversational voice assistants,” in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2024
work page 2024
-
[16]
Evaluating Speech Recognition Performance Towards Large Language Model Based V oice Assis- tants,
Z. Liu, S. Kim, and O. Kalinli, “Evaluating Speech Recognition Performance Towards Large Language Model Based V oice Assis- tants,” inInterspeech 2024, 2024, pp. 4099–4103
work page 2024
-
[17]
Casablanca: Data and models for multidialectal Ara- bic speech recognition,
B. Talafha, K. Kadaoui, S. M. Magdy, M. Habiboullah, C. M. Chafei, A. O. El-Shangiti, H. Zayed, M. C. Tourad, R. Alhamouri, R. Assi, A. Alraeesi, H. Mohamed, F. Alwajih, A. Mohamed, A. El Mekki, E. M. B. Nagoudi, B. D. M. Saadia, H. A. Alsayadi, W. Al-Dhabyani, S. Shatnawi, Y . Ech-chammakhy, A. Makouar, Y . Berrachedi, M. Jarrar, S. Shehata, I. Berrada, ...
work page 2024
-
[18]
WildChat: 1m chatgpt interaction logs in the wild,
W. Zhao, X. Ren, J. Hessel, C. Cardie, Y . Choi, and Y . Deng, “WildChat: 1m chatgpt interaction logs in the wild,” in The Twelfth International Conference on Learning Representations , 2024
work page 2024
-
[19]
Chatbot arena: An open platform for evaluating llms by human preference,
W.-L. Chiang, L. Zheng, Y . Sheng, A. N. Angelopoulos, T. Li, D. Li, B. Zhu, H. Zhang, M. Jordan, J. E. Gonzalez et al. , “Chatbot arena: An open platform for evaluating llms by human preference,” in International Conference on Machine Learning . PMLR, 2024, pp. 8359–8388
work page 2024
-
[20]
Judging LLM-as-a-judge with MT-bench and chatbot arena,
L. Zheng, W.-L. Chiang, Y . Sheng, S. Zhuang, Z. Wu, Y . Zhuang, Z. Lin, Z. Li, D. Li, E. P. Xing, H. Zhang, J. E. Gonzalez, and I. Stoica, “Judging LLM-as-a-judge with MT-bench and chatbot arena,” in Proceedings of the 37th International Conference on Neural Information Processing Systems, 2023, pp. 46 595–46 623
work page 2023
-
[21]
D. Bohus and A. I. Rudnicky, “Sorry, i didn’t catch that! – an in- vestigation of non-understanding errors and recovery strategies,” in Proceedings of SIGDIAL 2005, 2005
work page 2005
-
[22]
Detecting out-of-domain utterances addressed to a virtual personal assistant,
G. Tur, A. Deoras, and D. Hakkani-Tur, “Detecting out-of-domain utterances addressed to a virtual personal assistant,” in Proceed- ings of Interspeech 2014, 2014
work page 2014
-
[23]
A survey on asking clarification questions datasets in conversational systems,
H. A. Rahmani, X. Wang, Y . Feng, Q. Zhang, E. Yilmaz, and A. Lipani, “A survey on asking clarification questions datasets in conversational systems,” inProceedings of the 61st Annual Meet- ing of the Association for Computational Linguistics (Volume 1: Long Papers), A. Rogers, J. Boyd-Graber, and N. Okazaki, Eds. Toronto, Canada: Association for Computat...
work page 2023
-
[24]
J. G. Fiscus, “A post-processing system to yield reduced word error rates: Recognizer output voting error reduction (ROVER),” in Proceedings of the 1997 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), 1997
work page 1997
-
[25]
Multi-reference evaluation for dialectal speech recognition system,
A. Ali, P. Bell, and S. Renals, “Multi-reference evaluation for dialectal speech recognition system,” in Proceedings of the 4th Workshop on Arabic Natural Language Processing, 2015
work page 2015
-
[26]
Best practices for crowdsourc- ing dialectal Arabic speech transcription,
S. Wray, H. Mubarak, and A. Ali, “Best practices for crowdsourc- ing dialectal Arabic speech transcription,” in Proceedings of the 4th Workshop on Arabic Natural Language Processing, 2015
work page 2015
-
[27]
Better pseudo- labeling with multi-asr fusion and error correction by speechllm,
J. Prakash, B. Kumar, K. Hacioglu, B. Sharma, S. Gopalan, M. Chetlur, S. Venkatesan, and A. Stolcke, “Better pseudo- labeling with multi-asr fusion and error correction by speechllm,” in Interspeech 2025, 2025
work page 2025
-
[28]
Fanar: An arabic-centric multimodal generative ai platform,
F. Team, U. Abbas, M. S. Ahmad, F. Alam, E. Altinisik, E. As- gari, Y . Boshmaf, S. Boughorbel, S. Chawla, S. Chowdhuryet al., “Fanar: An arabic-centric multimodal generative ai platform,” arXiv:2501.13944, 2025
-
[29]
Gemini: A Family of Highly Capable Multimodal Models
G. Team, R. Anil, S. Borgeaud, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth, K. Millican et al., “Gemini: a family of highly capable multimodal models,” arXiv preprint arXiv:2312.11805, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[30]
ALLam: Large language models for arabic and english,
M. S. Bari, Y . Alnumay, N. A. Alzahrani, N. M. Alotaibi, H. A. Alyahya, S. AlRashed, F. A. Mirza, S. Z. Alsubaie, H. A. Alahmed, G. Alabduljabbar, R. Alkhathran, Y . Almushayqih, R. Alnajim, S. Alsubaihi, M. A. Mansour, S. A. Hassan, D. M. Alrubaian, A. Alammari, Z. Alawami, A. Al-Thubaity, A. Abde- lali, J. Kuriakose, A. Abujabal, N. Al-Twairesh, A. Alo...
work page 2025
-
[31]
Holistic evaluation of language models,
P. Liang, R. Bommasani, T. Lee, D. Tsipras, D. Soylu, M. Ya- sunaga, Y . Zhang, D. Narayanan, Y . Wu, A. Kumar, B. Newman, B. Yuan, B. Yan, C. Zhang, C. Cosgrove, C. D. Manning, C. Re, D. Acosta-Navas, D. A. Hudson, E. Zelikman et al. , “Holistic evaluation of language models,”Transactions on Machine Learn- ing Research, Aug. 2023, accepted by TMLR (OpenReview)
work page 2023
-
[32]
Training language mod- els to follow instructions with human feedback,
L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. L. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, J. Schulman, J. Hilton, F. Kelton, L. Miller, M. Simens, A. Askell, P. Welinder, P. F. Christiano, J. Leike, and R. Lowe, “Training language mod- els to follow instructions with human feedback,” in Advances in Neural Information Processing Systems...
work page 2022
-
[33]
Instruction-Following Evaluation for Large Language Models
J. Zhou, T. Lu, S. Mishra, S. Brahma, S. Basu, Y . Luan, D. Zhou, and L. Hou, “Instruction-following evaluation for large language models,”arXiv preprint arXiv:2311.07911, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[34]
TruthfulQA: Measuring how models mimic human falsehoods,
S. Lin, J. Hilton, and O. Evans, “TruthfulQA: Measuring how models mimic human falsehoods,” inProceedings of the 60th An- nual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Dublin, Ireland: Association for Com- putational Linguistics, May 2022, pp. 3214–3252
work page 2022
-
[35]
Constitutional AI: Harmlessness from AI Feedback
Y . Bai, S. Kadavath, S. Kundu, A. Askell, J. Kernion, A. Jones, A. Chen, A. Goldie, A. Mirhoseini, C. McKinnon, C. Chen, C. Olsson, C. Olah, D. Hernandez, D. Drain, D. Ganguli, D. Li, E. Tran-Johnson, E. Perez et al., “Constitutional AI: Harmless- ness from AI feedback,” arXiv preprint arXiv:2212.08073, Dec. 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[36]
OR-bench: An over-refusal benchmark for large language models,
J. Cui, W.-L. Chiang, I. Stoica, and C.-J. Hsieh, “OR-bench: An over-refusal benchmark for large language models,” in Proceed- ings of the 42nd International Conference on Machine Learn- ing, ser. Proceedings of Machine Learning Research, A. Singh, M. Fazel, D. Hsu, S. Lacoste-Julien, F. Berkenkamp, T. Maharaj, K. Wagstaff, and J. Zhu, Eds., vol. 267. PML...
work page 2025
-
[37]
SummEval: Re-evaluating summarization evalua- tion,
A. R. Fabbri, W. Kry ´sci´nski, B. McCann, C. Xiong, R. Socher, and D. Radev, “SummEval: Re-evaluating summarization evalua- tion,”Transactions of the Association for Computational Linguis- tics, vol. 9, pp. 391–409, 2021
work page 2021
-
[38]
Cultural bias and cultural alignment of large language models,
Y . Tao, O. Viberg, R. S. Baker, and R. F. Kizilcec, “Cultural bias and cultural alignment of large language models,” PNAS Nexus, vol. 3, no. 9, p. pgae346, Sep. 2024
work page 2024
-
[39]
MAGLIC the maghrebi language identification corpus,
K. Jones, K. Walker, C. Caruso, and S. Strassel, “MAGLIC the maghrebi language identification corpus,” in Proceedings of the Speaker and Language Recognition Workshop Odyssey 2024, 2024, pp. 86–90
work page 2024
-
[40]
ZAEBUC- Spoken a multilingual multidialectal arabic-english speech cor- pus,
I. Hamed, F. Eryani, D. Palfreyman, and N. Habash, “ZAEBUC- Spoken a multilingual multidialectal arabic-english speech cor- pus,” inProceedings of LREC-COLING 2024. ELRA Language Resource Association, 2024, pp. 17 770–17 782
work page 2024
-
[41]
Survey article: Inter-coder agreement for computational linguistics,
R. Artstein and M. Poesio, “Survey article: Inter-coder agreement for computational linguistics,”Computational linguistics, vol. 34, no. 4, pp. 555–596, 2008
work page 2008
-
[42]
Detecting ambiguous utterances in an intelligent assistant,
S. Akasaki and M. Sassano, “Detecting ambiguous utterances in an intelligent assistant,” in Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing Indus- try Track. Association for Computational Linguistics, 2024, pp. 386–394
work page 2024
-
[43]
Out-of-scope intent detection with self-supervision and discriminative training,
L.-M. Zhan, H. Liang, B. Liu, L. Fan, X.-M. Wu, and A. Y . S. Lam, “Out-of-scope intent detection with self-supervision and discriminative training,” inProceedings of the 59th Annual Meet- ing of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing. Association for Computational Linguisti...
work page 2021
-
[44]
Out-of-domain intent detection considering multi-turn dialogue contexts,
H. Lang, Y . Zheng, B. Hui, F. Huang, and Y . Li, “Out-of-domain intent detection considering multi-turn dialogue contexts,” inPro- ceedings of LREC-COLING 2024 . ELRA Language Resource Association, 2024, pp. 12 539–12 552
work page 2024
-
[45]
The iso standard for dialogue act an- notation, second edition,
H. Bunt, V . Petukhova, E. Gilmartin, C. Pelachaud, A. Fang, S. Keizer, and L. Prevot, “The iso standard for dialogue act an- notation, second edition,” inProceedings of the 12th LREC, 2020, pp. 549–558
work page 2020
-
[46]
Computing inter-rater reliability and its variance in the presence of high agreement,
K. L. Gwet, “Computing inter-rater reliability and its variance in the presence of high agreement,”British Journal of Mathematical and Statistical Psychology, vol. 61, no. 1, pp. 29–48, 2008
work page 2008
-
[47]
Cross- lingual acoustic modeling for dialectal Arabic speech recogni- tion,
M. Elmahdy, R. Gruhn, W. Minker, and S. Abdennadher, “Cross- lingual acoustic modeling for dialectal Arabic speech recogni- tion,” inInterspeech 2010, 2010, pp. 873–876
work page 2010
-
[48]
Towards One Model to Rule All: Multilingual Strategy for Dialectal Code- Switching Arabic ASR,
S. A. Chowdhury, A. Hussein, A. Abdelali, and A. Ali, “Towards One Model to Rule All: Multilingual Strategy for Dialectal Code- Switching Arabic ASR,” in Interspeech 2021, 2021, pp. 2466– 2470
work page 2021
-
[49]
Training language models to follow instructions with human feedback,
L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, J. Schulman, J. Hilton, F. Kelton, L. Miller, M. Simens, A. Askell, P. Welinder, P. F. Christiano, J. Leike, and R. Lowe, “Training language models to follow instructions with human feedback,” inAdvances in Neu- ral Information Processing Systems , v...
work page 2022
-
[50]
TruthfulQA: Measuring how models mimic human falsehoods,
S. Lin, J. Hilton, and O. Evans, “TruthfulQA: Measuring how models mimic human falsehoods,” inProceedings of the 60th An- nual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , S. Muresan, P. Nakov, and A. Villav- icencio, Eds. Dublin, Ireland: Association for Computational Linguistics, May 2022, pp. 3214–3252
work page 2022
-
[51]
RealToxicityPrompts: Evaluating neural toxic degeneration in language models,
S. Gehman, S. Gururangan, M. Sap, Y . Choi, and N. A. Smith, “RealToxicityPrompts: Evaluating neural toxic degeneration in language models,” in Findings of the Association for Computa- tional Linguistics: EMNLP 2020 , T. Cohn, Y . He, and Y . Liu, Eds. Online: Association for Computational Linguistics, Nov. 2020, pp. 3356–3369
work page 2020
-
[52]
A. Gatt and E. J. Krahmer, “Survey of the state of the art in natu- ral language generation: Core tasks, applications and evaluation,” Journal of Artificial Intelligence Research, vol. 61, no. 1, pp. 65– 170, 2018
work page 2018
-
[53]
Having beer af- ter prayer? measuring cultural bias in large language models,
T. Naous, M. J. Ryan, A. Ritter, and W. Xu, “Having beer af- ter prayer? measuring cultural bias in large language models,” in Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , L.-W. Ku, A. Martins, and V . Srikumar, Eds. Bangkok, Thailand: As- sociation for Computational Linguistics, Aug. 20...
work page 2024
-
[54]
PalmX 2025: The first shared task on benchmarking LLMs on Arabic and islamic culture,
F. Alwajih, A. El Mekki, H. Mubarak, M. Hawasly, A. Mohamed, and M. Abdul-Mageed, “PalmX 2025: The first shared task on benchmarking LLMs on Arabic and islamic culture,” in Proceed- ings of The Third Arabic Natural Language Processing Confer- ence: Shared Tasks, K. Darwish, A. Ali, I. Abu Farha, S. Touileb, I. Zitouni, A. Abdelali, S. Al-Ghamdi, S. Alkher...
work page 2025
-
[55]
SpokenNativQA: Multilingual everyday spoken queries for llms,
F. Alam, M. A. Hasan, and S. A. Chowdhury, “SpokenNativQA: Multilingual everyday spoken queries for llms,” inProceedings of the 26th Interspeech Conference (Interspeech 2025). Rotterdam, The Netherlands: ISCA, Aug. 2025
work page 2025
-
[56]
Qwen2.5-omni technical report,
J. Xu, Z. Guo, J. He, H. Hu, T. He, S. Bai, K. Chen, J. Wang, Y . Fan, K. Dang, B. Zhang, X. Wang, Y . Chu, and J. Lin, “Qwen2.5-omni technical report,” 2025
work page 2025
-
[57]
A. Singh, A. Fry, A. Perelman, A. Tart, A. Ganesh, A. El-Kishky, A. McLaughlin, A. Low, A. Ostrow, A. Ananthramet al., “Openai gpt-5 system card,”arXiv preprint arXiv:2601.03267, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [58]
-
[59]
Paperbench: Evaluating ai’s ability to replicate ai research,
G. Starace, O. Jaffe, D. Sherburn, J. Aung, J. S. Chan, L. Maksin, R. Dias, E. Mays, B. Kinsella, W. Thompson et al., “Paperbench: Evaluating ai’s ability to replicate ai research,” inICML. PMLR, 2025, pp. 56 843–56 873
work page 2025
-
[60]
X. Guo, U. Tyagi, A. Gosai, P. Vergara, J. Park, E. G. H. Mon- toya, C. B. C. Zhang, B. Hu, Y . He, B. Liuet al., “Beyond seeing: Evaluating multimodal llms on tool-enabled image perception, transformation, and reasoning,”arXiv preprint arXiv:2510.12712, 2025
-
[61]
Appendix 8.1. PROMPTS 8.1.1. Judge System Prompt for evaluating Transcription-based queries. You are a STRICT evaluator assessing whether an AI assistant truly understood the user’s intent and produced a high-quality, grounded response. You will receive: - user_query: the user’s original query (may be in Arabic dialect or English). This can be a question,...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.