pith. sign in

arxiv: 2605.17007 · v1 · pith:TW2NMZMHnew · submitted 2026-05-16 · 💻 cs.CL

HalluScore: Large Language Model Hallucination Question Answering Benchmark

Pith reviewed 2026-05-19 20:27 UTC · model grok-4.3

classification 💻 cs.CL
keywords llmsarabichallucinationhalluscorelanguagereasoningbenchmarkbenchmarks
0
0 comments X

The pith

HalluScore is a curated Arabic QA dataset with 827 questions, ground-truth evidence, and human annotations used to measure hallucination rates across 17 LLMs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models sometimes make up facts instead of sticking to what they know. This is called hallucination. Most existing tests for this problem are in English or Chinese, leaving Arabic with few good ways to check. The authors built HalluScore by collecting questions across different topics, time periods, and cultural settings in Arabic. They filtered the questions so that models are likely to hallucinate on them, added verified answers, and had people label the model outputs as fully hallucinated, partially hallucinated, or correct. They then ran 17 different models on the benchmark and recorded the patterns.

Core claim

We introduce HalluScore, a structured Arabic question answering benchmark designed to evaluate hallucination behavior in LLMs across different levels of reasoning difficulty, various knowledge domains, historical timelines, and culturally grounded Arabic scenarios. It contains 827 carefully curated questions.

Load-bearing premise

The model-driven selection process successfully retains only questions that consistently trigger hallucinations while preserving factual validity and cultural grounding; this premise is stated in the abstract's description of the construction pipeline but lacks independent verification details.

Figures

Figures reproduced from arXiv: 2605.17007 by Aisha Alansari, Hamzah Luqman.

Figure 1
Figure 1. Figure 1: The pipeline of HalluScore dataset construction and benchmarking. misconceptions and internationally relevant factual fallacies. This process resulted in an initial pool of 1,500 QA pairs, each with a verified source link to support the ground-truth answer. The ground-truth links were obtained from reliable sources, such as Wikipedia and official government websites. 3.2 Quality Assurance and Filtering To … view at source ↗
Figure 2
Figure 2. Figure 2: Prompt provided to Gemini-2.5-Flash to generate explanations of answers in [PITH_FULL_IMAGE:figures/full_fig_p011_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Type and domain distribution across the HalluScore dataset. (a) The type distribution across the questions. (b) The knowledge domain proportion across the questions. interactions between the four binary attributes: ad￾versarial intent, reasoning requirement, historical de￾pendency, and Arab cultural relevance. As shown in [PITH_FULL_IMAGE:figures/full_fig_p013_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Normalized knowledge domain distribution across types. [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Co-occurrence matrix of binary labels in [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Stacked distribution of factual hallucination percentages per model across question types. The [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Heatmap of pairwise Jaccard similarity between models based on factual hallucination instances. [PITH_FULL_IMAGE:figures/full_fig_p020_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Examples of partial hallucination responses generated by some LLMs on [PITH_FULL_IMAGE:figures/full_fig_p020_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Example of response style differences between Claude Sonnet and DeepSeek-R1 on a reasoning [PITH_FULL_IMAGE:figures/full_fig_p022_9.png] view at source ↗
read the original abstract

Large language models (LLMs) have achieved remarkable progress in natural language generation, but remain susceptible to hallucination. In response to growing concerns about hallucinations, several benchmarks have been developed, primarily in English and Chinese. However, Arabic remains underrepresented, with limited benchmarks for LLMs hallucination due to scarce annotated resources and the language's morphological complexity. Consequently, existing benchmarks do not adequately reflect the linguistic, cultural, and reasoning characteristics of Arabic. To address this gap, we introduce HalluScore, a structured Arabic question answering benchmark designed to evaluate hallucination behavior in LLMs across different levels of reasoning difficulty, various knowledge domains, historical timelines, and culturally grounded Arabic scenarios. It contains 827 carefully curated questions for evaluating, detecting, and mitigating hallucination in LLMs. The dataset was constructed through a structured pipeline involving quality assurance, filtering for clarity and factual validity, and model-driven selection to retain questions that consistently trigger hallucinations. Each question is linked to verified ground-truth evidence, answer explanations, and multi-label annotations. Using the HalluScore benchmark, we conduct a comprehensive empirical analysis of hallucination patterns across 17 Arabic, multilingual, and reasoning LLMs. Moreover, we provide high-quality human annotations identifying hallucinated, non-hallucinated, and partially hallucinated responses of all evaluated LLMs. These results suggest that hallucination in Arabic LLMs extends beyond factual inaccuracies, encompassing challenges related to cultural understanding, linguistic reasoning, and logical consistency. We release HalluScore to support future research on improving the reliability and cultural competence of LLMs in Arabic.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces HalluScore, an Arabic QA benchmark with 827 curated questions for evaluating LLM hallucinations across reasoning difficulty levels, knowledge domains, historical timelines, and culturally grounded scenarios. The dataset is built via a pipeline of quality assurance, factual filtering, and model-driven selection to retain questions that consistently trigger hallucinations; each question includes verified ground-truth evidence and multi-label annotations. The authors evaluate hallucination patterns in 17 Arabic, multilingual, and reasoning LLMs and supply human annotations distinguishing hallucinated, non-hallucinated, and partially hallucinated outputs.

Significance. If the curation pipeline proves robust and transparent, HalluScore would address a genuine gap by supplying the first large-scale, culturally attuned Arabic hallucination benchmark, complementing English- and Chinese-centric resources. The release of the dataset together with human annotations and the empirical analysis across 17 models are concrete strengths that could support future detection and mitigation work. Significance is currently limited by the absence of quantitative validation for the selection and annotation steps.

major comments (2)
  1. [Abstract / Methods] Abstract and construction pipeline: the model-driven selection step that retains questions 'that consistently trigger hallucinations' is load-bearing for the central claim that the 827 questions reliably diagnose Arabic hallucination phenomena. No concrete metrics (hallucination-rate threshold, number of independent runs, agreement criterion) or disclosure of the selection models (and whether they are disjoint from the 17 evaluated LLMs) are provided. This leaves open the possibility that retained questions are merely difficult for the curation models rather than generally diagnostic.
  2. [Human Annotations] Human annotation section: the manuscript claims 'high-quality human annotations' identifying hallucinated, non-hallucinated, and partially hallucinated responses, yet reports no inter-annotator agreement statistics (e.g., Cohen’s or Fleiss’ kappa), number of annotators, or resolution procedure. These details are required to substantiate the downstream claims that hallucination in Arabic LLMs extends to cultural understanding and logical consistency.
minor comments (2)
  1. [Abstract] The abstract lists coverage of 'different levels of reasoning difficulty, various knowledge domains, historical timelines' but supplies no distribution statistics or table showing how the 827 questions are allocated across these axes; adding such a breakdown would clarify benchmark balance.
  2. Consider adding a short table or appendix entry that lists the exact filtering criteria and quality-assurance checks applied before the model-driven selection step.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript introducing HalluScore. We address each major comment below and outline the revisions we will make to improve clarity and transparency.

read point-by-point responses
  1. Referee: [Abstract / Methods] Abstract and construction pipeline: the model-driven selection step that retains questions 'that consistently trigger hallucinations' is load-bearing for the central claim that the 827 questions reliably diagnose Arabic hallucination phenomena. No concrete metrics (hallucination-rate threshold, number of independent runs, agreement criterion) or disclosure of the selection models (and whether they are disjoint from the 17 evaluated LLMs) are provided. This leaves open the possibility that retained questions are merely difficult for the curation models rather than generally diagnostic.

    Authors: We appreciate the referee's point regarding the need for more transparency in the model-driven selection process. The current manuscript outlines the pipeline but does not include the specific metrics or model details mentioned. We will revise the Methods section to include the hallucination-rate threshold, number of independent runs, agreement criterion, and explicitly state the selection models used along with their disjointness from the 17 evaluated models. This addition will strengthen the claim that the questions are generally diagnostic of hallucination phenomena. revision: yes

  2. Referee: [Human Annotations] Human annotation section: the manuscript claims 'high-quality human annotations' identifying hallucinated, non-hallucinated, and partially hallucinated responses, yet reports no inter-annotator agreement statistics (e.g., Cohen’s or Fleiss’ kappa), number of annotators, or resolution procedure. These details are required to substantiate the downstream claims that hallucination in Arabic LLMs extends to cultural understanding and logical consistency.

    Authors: We acknowledge that the manuscript does not report inter-annotator agreement statistics, the number of annotators, or the resolution procedure. These details are indeed important to substantiate the quality of the annotations. In the revised version, we will include these quantitative measures, such as Fleiss' kappa, the number of annotators involved, and how disagreements were resolved, to support the claims about hallucination extending to cultural understanding and logical consistency. revision: yes

Circularity Check

0 steps flagged

Empirical benchmark curation shows no definitional or fitted circularity

full rationale

The paper presents an empirical dataset construction pipeline for an Arabic hallucination QA benchmark. The central contribution is the 827-question collection itself, built via quality assurance, factual filtering, and model-driven selection. No mathematical derivations, equations, or first-principles claims exist that reduce to their own inputs. The model-driven selection step is described at a high level but does not equate to a 'prediction' or self-definition by construction; it is a filtering heuristic whose details would require external verification rather than internal reduction. Any self-citations are incidental and non-load-bearing for the benchmark's existence or utility. The work is self-contained as a resource contribution with downstream empirical analysis.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are described. The work relies on standard assumptions about human annotation reliability and question curation validity.

pith-pipeline@v0.9.0 · 5817 in / 1069 out tokens · 34996 ms · 2026-05-19T20:27:23.924146+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

69 extracted references · 69 canonical work pages · 6 internal anchors

  1. [1]

    On faithfulness and factuality in ab- stractive summarization,

    J. Maynez, S. Narayan, B. Bohnet, and R. Mc- Donald, “On faithfulness and factuality in ab- stractive summarization,” inProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020, pp. 1906–1919

  2. [2]

    A survey on hallucination in large lan- guage models: Principles, taxonomy, challenges, and open questions,

    L. Huang, W. Yu, W. Ma, W. Zhong, Z. Feng, H. Wang, Q. Chen, W. Peng, X. Feng, B. Qin et al., “A survey on hallucination in large lan- guage models: Principles, taxonomy, challenges, and open questions,”ACM Transactions on In- formation Systems, vol.43, no.2, pp.1–55, 2025

  3. [3]

    Arahallueval: A fine-grained hallucination evaluation framework for arabic llms,

    A. Alansari and H. Luqman, “Arahallueval: A fine-grained hallucination evaluation framework for arabic llms,” inProceedings of The Third Arabic Natural Language Processing Conference, 2025, pp. 148–161

  4. [4]

    Survey of hallucination in natural language generation,

    Z. Ji, N. Lee, R. Frieske, T. Yu, D. Su, Y. Xu, E. Ishii, Y. J. Bang, A. Madotto, and P. Fung, “Survey of hallucination in natural language generation,”ACM computing surveys, vol. 55, no. 12, pp. 1–38, 2023

  5. [5]

    A sur- vey of automatic hallucination evaluation on natural language generation,

    S. Qi, L. Gui, Y. He, and Z. Yuan, “A sur- vey of automatic hallucination evaluation on natural language generation,”arXiv preprint arXiv:2404.12041, 2024. 23

  6. [6]

    Large language models hallucination: A comprehen- sive survey.arXiv preprint arXiv:2510.06265, 2025

    A. Alansari and H. Luqman, “Large language models hallucination: A comprehensive survey,” arXiv preprint arXiv:2510.06265, 2025

  7. [7]

    ALLaM: Large language models for arabic and english

    M. S. Bari, Y. Alnumay, N. A. Alzahrani, N. M. Alotaibi, H. A. Alyahya, S. AlRashed, F. A. Mirza, S. Z. Alsubaie, H. A. Alahmed, G. Al- abduljabbaret al., “Allam: Large language models for arabic and english,”arXiv preprint arXiv:2407.15390, 2024

  8. [8]

    Jais and jais-chat: Arabic-centric foundation and instruction-tuned open gener- ative large language models

    N. Sengupta, S. K. Sahu, B. Jia, S. Katipomu, H. Li, F. Koto, W. Marshall, G. Gosal, C. Liu, Z. Chenet al., “Jais and jais-chat: Arabic-centric foundation and instruction-tuned open generative large language models,”arXiv preprint arXiv:2308.16149, 2023

  9. [9]

    Fanar: An arabic-centric multimodal generative ai platform,

    F. Team, U. Abbas, M. S. Ahmad, F. Alam, E. Altinisik, E. Asgari, Y. Boshmaf, S. Boughor- bel, S. Chawla, S. Chowdhuryet al., “Fanar: An arabic-centric multimodal generative ai plat- form,”arXiv preprint arXiv:2501.13944, 2025

  10. [10]

    A survey of large language models for ara- bic language and its dialects,

    M. Mashaabi, S. Al-Khalifa, and H. Al-Khalifa, “A survey of large language models for ara- bic language and its dialects,”arXiv preprint arXiv:2410.20238, 2024

  11. [11]

    Evaluating ara- bic large language models: A survey of bench- marks, methods, and gaps,

    A. Alzubaidi, S. Alsuwaidi, B. E. A. Bous- saha, L. AlQadi, O. Alkaabi, M. Alyafeai, H. Alobeidli, and H. Hacid, “Evaluating ara- bic large language models: A survey of bench- marks, methods, and gaps,”arXiv preprint arXiv:2510.13430, 2025

  12. [12]

    Arabic natural language processing: Challenges and solutions,

    A. Farghaly and K. Shaalan, “Arabic natural language processing: Challenges and solutions,” ACM Transactions on Asian Language Informa- tion Processing (TALIP), vol. 8, no. 4, pp. 1–22, 2009

  13. [13]

    N. Y. Habash,Introduction to Arabic natural language processing. Morgan & Claypool Pub- lishers, 2010

  14. [14]

    Halwasa: Quantify and analyze halluci- nations in large language models: Arabic as a case study,

    H. Mubarak, H. Al-Khalifa, and K. S. Alkhale- fah, “Halwasa: Quantify and analyze halluci- nations in large language models: Arabic as a case study,” inProceedings of the 2024 Joint In- ternational Conference on Computational Lin- guistics, Language Resources and Evaluation (LREC-COLING 2024), 2024, pp. 8008–8015

  15. [15]

    RoboClaw: Agentic robotic framework for scalable and long-horizon task execution with VLMs,

    S. Abdaljalil, H. Kurban, and E. Serpedin, “Hal- luverse25: Fine-grained multilingual benchmark dataset for llm hallucinations,”arXiv preprint arXiv:2503.07833, 2025

  16. [16]

    Aftina: enhanc- ing stability and preventing hallucination in ai- based islamic fatwa generation using llms and rag,

    M. Y. Mohammed, S. A. Ali, S. K. Ali, A. A. Majeed, and E. H. Mohamed, “Aftina: enhanc- ing stability and preventing hallucination in ai- based islamic fatwa generation using llms and rag,”Neural Computing and Applications, pp. 1–26, 2025

  17. [17]

    Islamiceval2025: The first shared task of capturing llms hallucination in islamic content,

    H. Mubarak, R. Malhas, W. Mansour, A. Mo- hamed, M.Fawzi, M.Hawasly, T.Elsayed, K.M. Darwish, and W. Magdy, “Islamiceval2025: The first shared task of capturing llms hallucination in islamic content,” inProceedings of The Third Arabic Natural Language Processing Conference: Shared Tasks, 2025, pp. 480–493

  18. [18]

    arXiv preprint arXiv:2305.11747 (2023)

    J. Li, X. Cheng, W. X. Zhao, J.-Y. Nie, and J.- R. Wen, “Halueval: A large-scale hallucination evaluation benchmark for large language mod- els,”arXiv preprint arXiv:2305.11747, 2023

  19. [19]

    Analyzing llm behavior in dialogue sum- marization: Unveiling circumstantial hallucina- tion trends,

    S. Ramprasad, E. Ferracane, and Z. C. Lip- ton, “Analyzing llm behavior in dialogue sum- marization: Unveiling circumstantial hallucina- tion trends,” inProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024, pp. 12549–12561

  20. [20]

    Evaluating hallucinations in chinese large language models,

    Q. Cheng, T. Sun, W. Zhang, S. Wang, X. Liu, M. Zhang, J. He, M. Huang, Z. Yin, K. Chenet al., “Evaluating hallucinations in chinese large language models,”arXiv preprint arXiv:2310.03368, 2023

  21. [21]

    Uhgeval: Benchmarking the hallucina- tion of chinese large language models via uncon- strained generation,

    X. Liang, S. Song, S. Niu, Z. Li, F. Xiong, B. Tang, Y. Wang, D. He, C. Peng, Z. Wang et al., “Uhgeval: Benchmarking the hallucina- tion of chinese large language models via uncon- strained generation,” inProceedings of the 62nd Annual Meeting of the Association for Compu- tational Linguistics (Volume 1: Long Papers), 2024, pp. 5266–5293

  22. [22]

    C-faith: A chinese fine-grained benchmark for automated halluci- nation evaluation,

    X. Zhang, Z. Liu, J. Wang, H. Zhang, F. Xu, J. Zhang, and X. Wan, “C-faith: A chinese fine-grained benchmark for automated halluci- nation evaluation,” inProceedings of the 34th ACM International Conference on Information 24 and Knowledge Management, 2025, pp. 6575– 6579

  23. [23]

    Retrieve only when it needs: Adap- tive retrieval augmentation for hallucination mitigation in large language models,

    H. Ding, L. Pang, Z. Wei, H. Shen, and X. Cheng, “Retrieve only when it needs: Adap- tive retrieval augmentation for hallucination mitigation in large language models,”arXiv preprint arXiv:2402.10612, 2024

  24. [24]

    Exploring rag solu- tions to reduce hallucinations in llms,

    S. AboulEla, P. Zabihitari, N. Ibrahim, M. Af- shar, and R. Kashef, “Exploring rag solu- tions to reduce hallucinations in llms,” in 2025 IEEE International systems Conference (SysCon). IEEE, 2025, pp. 1–8

  25. [25]

    Detecting hallucinations in large language models using semantic entropy,

    S. Farquhar, J. Kossen, L. Kuhn, and Y. Gal, “Detecting hallucinations in large language models using semantic entropy,”Nature, vol. 630, no. 8017, pp. 625–630, 2024

  26. [26]

    En- hancing uncertainty-based hallucination detec- tion with stronger focus,

    T. Zhang, L. Qiu, Q. Guo, C. Deng, Y. Zhang, Z. Zhang, C. Zhou, X. Wang, and L. Fu, “En- hancing uncertainty-based hallucination detec- tion with stronger focus,” inProceedings of the 2023 Conference on Empirical Methods in Nat- ural Language Processing, 2023, pp. 915–932

  27. [27]

    Detecting and mitigating hallucinations in machine translation: Model internal workings alone do well, sentence similarity even better,

    D. Dale, E. Voita, L. Barrault, and M. R. Costa- Jussà, “Detecting and mitigating hallucinations in machine translation: Model internal workings alone do well, sentence similarity even better,” inProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Vol- ume 1: Long Papers), 2023, pp. 36–50

  28. [28]

    Leveraging graph structures to de- tect hallucinations in large language models,

    N. Nonkes, S. Agaronian, E. Kanoulas, and R. Petcu, “Leveraging graph structures to de- tect hallucinations in large language models,” inProceedings of TextGraphs-17: Graph-based Methods for Natural Language Processing, 2024, pp. 93–104

  29. [29]

    Halugnn: Hallucination detection in large language models using graph neural network,

    L. Kong, Y. Zhang, X. Zhong, H. Fu, Y. Wang, and H. Liu, “Halugnn: Hallucination detection in large language models using graph neural network,”Expert Systems with Applications, p. 130857, 2025

  30. [30]

    Hallushift: Measuring distribution shifts towards hallucination detection in llms,

    S.Dasgupta, S.Nath, A.Basu, P.Shamsolmoali, and S. Das, “Hallushift: Measuring distribution shifts towards hallucination detection in llms,” in2025 International Joint Conference on Neu- ral Networks (IJCNN). IEEE, 2025, pp. 1–8

  31. [31]

    Selfcheck- gpt: Zero-resource black-box hallucination de- tection for generative large language models,

    P. Manakul, A. Liusie, and M. Gales, “Selfcheck- gpt: Zero-resource black-box hallucination de- tection for generative large language models,” inProceedings of the 2023 Conference on Em- pirical Methods in Natural Language Processing, 2023, pp. 9004–9017

  32. [32]

    Sac3: reliable hallucination detection in black- box language models via semantic-aware cross- check consistency,

    J.Zhang, Z.Li, K.Das, B.Malin, and S.Kumar, “Sac3: reliable hallucination detection in black- box language models via semantic-aware cross- check consistency,” inFindings of the Associ- ation for Computational Linguistics: EMNLP 2023, 2023, pp. 15445–15458

  33. [33]

    Ai- generated news articles based on large language models,

    K. Jiang, Q. Zhang, D. Guo, D. Huang, S. Zhang, Z. Wei, F. Ning, and R. Li, “Ai- generated news articles based on large language models,” inProceedings of the 2023 Interna- tional Conference on Artificial Intelligence, Sys- tems and Network Security, 2023, pp. 82–87

  34. [34]

    Self- expertise: knowledge-based instruction dataset augmentation for a legal expert language model,

    M. Kim, H. Jung, and M.-W. Koo, “Self- expertise: knowledge-based instruction dataset augmentation for a legal expert language model,” inFindings of the Association for Com- putational Linguistics: NAACL 2024, 2024, pp. 1098–1112

  35. [35]

    A survey on rag with llms,

    M. Arslan, H. Ghanem, S. Munawar, and C. Cruz, “A survey on rag with llms,”Proce- dia computer science, vol. 246, pp. 3781–3790, 2024

  36. [36]

    Chain- of-thought prompting elicits reasoning in large language models,

    J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V. Le, D. Zhouet al., “Chain- of-thought prompting elicits reasoning in large language models,”Advances in neural informa- tion processing systems, vol. 35, pp. 24824– 24837, 2022

  37. [37]

    Chain- of-verification reduces hallucination in large lan- guage models,

    S. Dhuliawala, M. Komeili, J. Xu, R. Raileanu, X. Li, A. Celikyilmaz, and J. Weston, “Chain- of-verification reduces hallucination in large lan- guage models,” inFindings of the association for computational linguistics: ACL 2024, 2024, pp. 3563–3578

  38. [38]

    Mitigating large language model hallu- cinationwithfaithfulfinetuning,

    M. Hu, B. He, Y. Wang, L. Li, C. Ma, and I. King, “Mitigating large language model hallu- cinationwithfaithfulfinetuning,”arXiv preprint arXiv:2406.11267, 2024

  39. [39]

    DoLa: Decoding by Contrasting Layers Improves Factuality in Large Language Models

    Y.-S. Chuang, Y. Xie, H. Luo, Y. Kim, J. Glass, and P. He, “Dola: Decoding by contrasting lay- 25 ers improves factuality in large language mod- els,”arXiv preprint arXiv:2309.03883, 2023

  40. [40]

    Feqa: A ques- tion answering evaluation framework for faith- fulness assessment in abstractive summariza- tion,

    E. Durmus, H. He, and M. Diab, “Feqa: A ques- tion answering evaluation framework for faith- fulness assessment in abstractive summariza- tion,” inProceedings of the 58th Annual Meet- ing of the Association for Computational Lin- guistics, 2020, pp. 5055–5070

  41. [41]

    Evaluating the factual consistency of abstractive text summarization,

    W. Kryściński, B. McCann, C. Xiong, and R. Socher, “Evaluating the factual consistency of abstractive text summarization,” inProceed- ings of the 2020 Conference on Empirical Meth- ods in Natural Language Processing (EMNLP), 2020, pp. 9332–9346

  42. [42]

    Factscore: Fine-grained atomic evalu- ation of factual precision in long form text gen- eration,

    S.Min, K.Krishna, X.Lyu, M.Lewis, W.-t.Yih, P. Koh, M. Iyyer, L. Zettlemoyer, and H. Ha- jishirzi, “Factscore: Fine-grained atomic evalu- ation of factual precision in long form text gen- eration,” inProceedings of the 2023 Conference on Empirical Methods in Natural Language Pro- cessing, 2023, pp. 12076–12100

  43. [43]

    How reliable are automatic eval- uation methods for instruction-tuned llms?

    E. Doostmohammadi, O. Holmström, and M. Kuhlmann, “How reliable are automatic eval- uation methods for instruction-tuned llms?” in Findings of the Association for Computational Linguistics: EMNLP 2024, 2024, pp. 6321–6336

  44. [44]

    Truthfulqa: Measuring how models mimic human false- hoods,

    S. Lin, J. Hilton, and O. Evans, “Truthfulqa: Measuring how models mimic human false- hoods,” inProceedings of the 60th Annual Meet- ing of the Association for Computational Lin- guistics (Volume 1: Long Papers), 2022, pp. 3214–3252

  45. [45]

    Freshllms: Refreshing large language models with search engine augmentation,

    T. Vu, M. Iyyer, X. Wang, N. Constant, J. Wei, J.Wei, C.Tar, Y.-H.Sung, D.Zhou, Q.Leet al., “Freshllms: Refreshing large language models with search engine augmentation,” inFindings of the Association for Computational Linguistics ACL 2024, 2024, pp. 13697–13720

  46. [46]

    Assessing the factual accuracy of generated text,

    B. Goodrich, V. Rao, P. J. Liu, and M. Saleh, “Assessing the factual accuracy of generated text,” inproceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining, 2019, pp. 166–175

  47. [47]

    Generativeaiforislamictexts: Theeman framework for mitigating gpt hallucinations,

    A.ElGanadi, S.Aftar, L.Gagliardelli, F.Ruozzi et al., “Generativeaiforislamictexts: Theeman framework for mitigating gpt hallucinations,” in roceedings of the 17th International Conference on Agents and Artificial Intelligence-ICAART, vol. 3, 2025, pp. 1221–1228

  48. [48]

    Mitigating llm hal- lucinations in quranic content: An agentic ap- proach using deployable language models,

    M. F. Alghifari, M. Kartiwi, M. B. A. Zaim, and D. O. D. Handayani, “Mitigating llm hal- lucinations in quranic content: An agentic ap- proach using deployable language models,” in 2025 10th International Conference on Infor- mation and Communication Technology for the Muslim World (ICT4M). IEEE, 2025, pp. 1–6

  49. [49]

    Semeval-2025 task 3: Mu-shroom, the multilingual shared task on hallucinations and related observable overgeneration mistakes,

    R. Vázquez, T. Mickus, E. Zosa, T. Vah- tola, J. Tiedemann, A. Sinha, V. Segonne, F. Sánchez-Vega, A. Raganato, J. Libovick` y et al., “Semeval-2025 task 3: Mu-shroom, the multilingual shared task on hallucinations and related observable overgeneration mistakes,” arXiv preprint arXiv:2504.11975, 2025

  50. [50]

    Halomi: A manually anno- tated benchmark for multilingual hallucination and omission detection in machine translation,

    D. Dale, E. Voita, J. Lam, P. Hansanti, C. Rop- ers, E. Kalbassi, C. Gao, L. Barrault, and M. Costa-jussà, “Halomi: A manually anno- tated benchmark for multilingual hallucination and omission detection in machine translation,” inProceedings of the 2023 Conference on Em- pirical Methods in Natural Language Processing, 2023, pp. 638–653

  51. [51]

    Poly-fever: A multi- lingualfactverificationbenchmarkforhallucina- tion detection in large language models,

    H. Zhang, S. Anjum, H. Fan, W. Zheng, Y. Huang, and Y. Feng, “Poly-fever: A multi- lingualfactverificationbenchmarkforhallucina- tion detection in large language models,”arXiv preprint arXiv:2503.16541, 2025

  52. [52]

    Hot- potqa: A dataset for diverse, explainable multi- hop question answering,

    Z. Yang, P. Qi, S. Zhang, Y. Bengio, W. Cohen, R. Salakhutdinov, and C. D. Manning, “Hot- potqa: A dataset for diverse, explainable multi- hop question answering,” inProceedings of the 2018 conference on empirical methods in natu- ral language processing, 2018, pp. 2369–2380

  53. [53]

    Triviaqa: A large scale distantly super- vised challenge dataset for reading comprehen- sion,

    M. Joshi, E. Choi, D. S. Weld, and L. Zettle- moyer, “Triviaqa: A large scale distantly super- vised challenge dataset for reading comprehen- sion,” inProceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2017, pp. 1601–1611

  54. [54]

    Medhallu: A comprehen- sive benchmark for detecting medical hallucina- tions in large language models,

    S. Pandit, J. Xu, J. Hong, Z. Wang, T. Chen, K. Xu, and Y. Ding, “Medhallu: A comprehen- sive benchmark for detecting medical hallucina- tions in large language models,” inProceedings 26 of the 2025 Conference on Empirical Methods in Natural Language Processing, 2025, pp. 2858– 2873

  55. [55]

    Defan: Definitive answer dataset for llm hallucination evaluation,

    A. A. Rahman, S. Anwar, M. Usman, I. Ahmad, and A. Mian, “Defan: Definitive answer dataset for llm hallucination evaluation,”Information, vol. 16, no. 11, p. 937, 2025

  56. [56]

    Naseej launches its innovative arabic ai language model “noon

    Naseej for Technology, “Naseej launches its innovative arabic ai language model “noon” as an open-source initiative,” Jun. 19 2023, accessed: 2025-07-02. [Online]. Available: https: //naseej.com/news/2023/06/

  57. [57]

    Introducing claude sonnet 4.5,

    Anthropic, “Introducing claude sonnet 4.5,” 2025, accessed: 2026-03-18. [Online]. Available: https://www.anthropic.com/news/ claude-sonnet-4-5

  58. [58]

    DeepSeek-V3 Technical Report

    A. Liu, B. Feng, B. Xue, B. Wang, B. Wu, C. Lu, C. Zhao, C. Deng, C. Zhang, C. Ruanet al., “Deepseek-v3 technical report,”arXiv preprint arXiv:2412.19437, 2024

  59. [59]

    [Online]

    xAI, “Grok 4,” 2025, accessed: 2026-03-18. [Online]. Available: https://x.ai/news/grok-4

  60. [60]

    GPT-4 Technical Report

    J. Achiam, S. Adler, S. Agarwal, L. Ah- mad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat et al., “Gpt-4 technical report,”arXiv preprint arXiv:2303.08774, 2023

  61. [61]

    OpenAI GPT-5 System Card

    A. Singh, A. Fry, A. Perelman, A. Tart, A. Ganesh, A. El-Kishky, A. McLaughlin, A. Low, A. Ostrow, A. Ananthramet al., “Openai gpt-5 system card,”arXiv preprint arXiv:2601.03267, 2025

  62. [62]

    Llama-4-maverick- 17b-128e-instruct-fp8,

    Meta AI, “Llama-4-maverick- 17b-128e-instruct-fp8,” https://ai.azure.com/catalog/models/Llama- 4-Maverick-17B-128E-Instruct-FP8, 2025, azure AI Foundry model catalog. Accessed: 2026-03-18

  63. [63]

    Qwen3-next- 80b-a3b-instruct,

    Alibaba Qwen Team, “Qwen3-next- 80b-a3b-instruct,” 2025, qwen offi- cial blog. Accessed: 2026-03-18. [On- line]. Available: https://qwen.ai/blog?id= 4074cca80393150c248e508aa62983f9cb7d27cd

  64. [64]

    Qwen3-235b-a22b-instruct-2507-fp8,

    ——, “Qwen3-235b-a22b-instruct-2507-fp8,” 2025, together AI model cata- log. Accessed: 2026-03-18. [Online]. Available: https://www.together.ai/models/ qwen3-235b-a22b-instruct-2507-fp8

  65. [65]

    System card: Claude opus 4 and claude sonnet 4,

    Anthropic, “System card: Claude opus 4 and claude sonnet 4,” Anthropic, Tech. Rep., 2025, accessed: 2026-03-18. [Online]. Available: https://www-cdn.anthropic.com/ 4263b940cabb546aa0e3283f35b686f4f3b2ff47. pdf

  66. [66]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P. Wang, X. Biet al., “Deepseek-r1: Incentivizing reasoning capabil- ity in llms via reinforcement learning,”arXiv preprint arXiv:2501.12948, 2025

  67. [67]

    Openai o3 and o4-mini system card,

    OpenAI, “Openai o3 and o4-mini system card,” OpenAI,Tech.Rep., 2025, accessed: 2026-03-18. [Online]. Available: https://cdn.openai.com/ pdf/2221c875-02dc-4789-800b-e7758f3722c1/ o3-and-o4-mini-system-card.pdf

  68. [68]

    The double-edged sword of anthro- pomorphism in llms,

    M. G. Reinecke, F. Ting, J. Savulescu, and I. Singh, “The double-edged sword of anthro- pomorphism in llms,” inProceedings, vol. 114, no. 1. MDPI, 2025, p. 4

  69. [69]

    Breaking the illusion: Revisiting llm anthropomorphism,

    C. Sypherd, W. Tang, and V. Belle, “Breaking the illusion: Revisiting llm anthropomorphism,” inThe 4th International Conference on Human and Artificial Rationalities. Springer Nature, 2025, pp. 1–19. 27