HalluScore: Large Language Model Hallucination Question Answering Benchmark
Pith reviewed 2026-05-19 20:27 UTC · model grok-4.3
The pith
HalluScore is a curated Arabic QA dataset with 827 questions, ground-truth evidence, and human annotations used to measure hallucination rates across 17 LLMs.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We introduce HalluScore, a structured Arabic question answering benchmark designed to evaluate hallucination behavior in LLMs across different levels of reasoning difficulty, various knowledge domains, historical timelines, and culturally grounded Arabic scenarios. It contains 827 carefully curated questions.
Load-bearing premise
The model-driven selection process successfully retains only questions that consistently trigger hallucinations while preserving factual validity and cultural grounding; this premise is stated in the abstract's description of the construction pipeline but lacks independent verification details.
Figures
read the original abstract
Large language models (LLMs) have achieved remarkable progress in natural language generation, but remain susceptible to hallucination. In response to growing concerns about hallucinations, several benchmarks have been developed, primarily in English and Chinese. However, Arabic remains underrepresented, with limited benchmarks for LLMs hallucination due to scarce annotated resources and the language's morphological complexity. Consequently, existing benchmarks do not adequately reflect the linguistic, cultural, and reasoning characteristics of Arabic. To address this gap, we introduce HalluScore, a structured Arabic question answering benchmark designed to evaluate hallucination behavior in LLMs across different levels of reasoning difficulty, various knowledge domains, historical timelines, and culturally grounded Arabic scenarios. It contains 827 carefully curated questions for evaluating, detecting, and mitigating hallucination in LLMs. The dataset was constructed through a structured pipeline involving quality assurance, filtering for clarity and factual validity, and model-driven selection to retain questions that consistently trigger hallucinations. Each question is linked to verified ground-truth evidence, answer explanations, and multi-label annotations. Using the HalluScore benchmark, we conduct a comprehensive empirical analysis of hallucination patterns across 17 Arabic, multilingual, and reasoning LLMs. Moreover, we provide high-quality human annotations identifying hallucinated, non-hallucinated, and partially hallucinated responses of all evaluated LLMs. These results suggest that hallucination in Arabic LLMs extends beyond factual inaccuracies, encompassing challenges related to cultural understanding, linguistic reasoning, and logical consistency. We release HalluScore to support future research on improving the reliability and cultural competence of LLMs in Arabic.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces HalluScore, an Arabic QA benchmark with 827 curated questions for evaluating LLM hallucinations across reasoning difficulty levels, knowledge domains, historical timelines, and culturally grounded scenarios. The dataset is built via a pipeline of quality assurance, factual filtering, and model-driven selection to retain questions that consistently trigger hallucinations; each question includes verified ground-truth evidence and multi-label annotations. The authors evaluate hallucination patterns in 17 Arabic, multilingual, and reasoning LLMs and supply human annotations distinguishing hallucinated, non-hallucinated, and partially hallucinated outputs.
Significance. If the curation pipeline proves robust and transparent, HalluScore would address a genuine gap by supplying the first large-scale, culturally attuned Arabic hallucination benchmark, complementing English- and Chinese-centric resources. The release of the dataset together with human annotations and the empirical analysis across 17 models are concrete strengths that could support future detection and mitigation work. Significance is currently limited by the absence of quantitative validation for the selection and annotation steps.
major comments (2)
- [Abstract / Methods] Abstract and construction pipeline: the model-driven selection step that retains questions 'that consistently trigger hallucinations' is load-bearing for the central claim that the 827 questions reliably diagnose Arabic hallucination phenomena. No concrete metrics (hallucination-rate threshold, number of independent runs, agreement criterion) or disclosure of the selection models (and whether they are disjoint from the 17 evaluated LLMs) are provided. This leaves open the possibility that retained questions are merely difficult for the curation models rather than generally diagnostic.
- [Human Annotations] Human annotation section: the manuscript claims 'high-quality human annotations' identifying hallucinated, non-hallucinated, and partially hallucinated responses, yet reports no inter-annotator agreement statistics (e.g., Cohen’s or Fleiss’ kappa), number of annotators, or resolution procedure. These details are required to substantiate the downstream claims that hallucination in Arabic LLMs extends to cultural understanding and logical consistency.
minor comments (2)
- [Abstract] The abstract lists coverage of 'different levels of reasoning difficulty, various knowledge domains, historical timelines' but supplies no distribution statistics or table showing how the 827 questions are allocated across these axes; adding such a breakdown would clarify benchmark balance.
- Consider adding a short table or appendix entry that lists the exact filtering criteria and quality-assurance checks applied before the model-driven selection step.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript introducing HalluScore. We address each major comment below and outline the revisions we will make to improve clarity and transparency.
read point-by-point responses
-
Referee: [Abstract / Methods] Abstract and construction pipeline: the model-driven selection step that retains questions 'that consistently trigger hallucinations' is load-bearing for the central claim that the 827 questions reliably diagnose Arabic hallucination phenomena. No concrete metrics (hallucination-rate threshold, number of independent runs, agreement criterion) or disclosure of the selection models (and whether they are disjoint from the 17 evaluated LLMs) are provided. This leaves open the possibility that retained questions are merely difficult for the curation models rather than generally diagnostic.
Authors: We appreciate the referee's point regarding the need for more transparency in the model-driven selection process. The current manuscript outlines the pipeline but does not include the specific metrics or model details mentioned. We will revise the Methods section to include the hallucination-rate threshold, number of independent runs, agreement criterion, and explicitly state the selection models used along with their disjointness from the 17 evaluated models. This addition will strengthen the claim that the questions are generally diagnostic of hallucination phenomena. revision: yes
-
Referee: [Human Annotations] Human annotation section: the manuscript claims 'high-quality human annotations' identifying hallucinated, non-hallucinated, and partially hallucinated responses, yet reports no inter-annotator agreement statistics (e.g., Cohen’s or Fleiss’ kappa), number of annotators, or resolution procedure. These details are required to substantiate the downstream claims that hallucination in Arabic LLMs extends to cultural understanding and logical consistency.
Authors: We acknowledge that the manuscript does not report inter-annotator agreement statistics, the number of annotators, or the resolution procedure. These details are indeed important to substantiate the quality of the annotations. In the revised version, we will include these quantitative measures, such as Fleiss' kappa, the number of annotators involved, and how disagreements were resolved, to support the claims about hallucination extending to cultural understanding and logical consistency. revision: yes
Circularity Check
Empirical benchmark curation shows no definitional or fitted circularity
full rationale
The paper presents an empirical dataset construction pipeline for an Arabic hallucination QA benchmark. The central contribution is the 827-question collection itself, built via quality assurance, factual filtering, and model-driven selection. No mathematical derivations, equations, or first-principles claims exist that reduce to their own inputs. The model-driven selection step is described at a high level but does not equate to a 'prediction' or self-definition by construction; it is a filtering heuristic whose details would require external verification rather than internal reduction. Any self-citations are incidental and non-load-bearing for the benchmark's existence or utility. The work is self-contained as a resource contribution with downstream empirical analysis.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
On faithfulness and factuality in ab- stractive summarization,
J. Maynez, S. Narayan, B. Bohnet, and R. Mc- Donald, “On faithfulness and factuality in ab- stractive summarization,” inProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020, pp. 1906–1919
work page 2020
-
[2]
L. Huang, W. Yu, W. Ma, W. Zhong, Z. Feng, H. Wang, Q. Chen, W. Peng, X. Feng, B. Qin et al., “A survey on hallucination in large lan- guage models: Principles, taxonomy, challenges, and open questions,”ACM Transactions on In- formation Systems, vol.43, no.2, pp.1–55, 2025
work page 2025
-
[3]
Arahallueval: A fine-grained hallucination evaluation framework for arabic llms,
A. Alansari and H. Luqman, “Arahallueval: A fine-grained hallucination evaluation framework for arabic llms,” inProceedings of The Third Arabic Natural Language Processing Conference, 2025, pp. 148–161
work page 2025
-
[4]
Survey of hallucination in natural language generation,
Z. Ji, N. Lee, R. Frieske, T. Yu, D. Su, Y. Xu, E. Ishii, Y. J. Bang, A. Madotto, and P. Fung, “Survey of hallucination in natural language generation,”ACM computing surveys, vol. 55, no. 12, pp. 1–38, 2023
work page 2023
-
[5]
A sur- vey of automatic hallucination evaluation on natural language generation,
S. Qi, L. Gui, Y. He, and Z. Yuan, “A sur- vey of automatic hallucination evaluation on natural language generation,”arXiv preprint arXiv:2404.12041, 2024. 23
-
[6]
Large language models hallucination: A comprehen- sive survey.arXiv preprint arXiv:2510.06265, 2025
A. Alansari and H. Luqman, “Large language models hallucination: A comprehensive survey,” arXiv preprint arXiv:2510.06265, 2025
-
[7]
ALLaM: Large language models for arabic and english
M. S. Bari, Y. Alnumay, N. A. Alzahrani, N. M. Alotaibi, H. A. Alyahya, S. AlRashed, F. A. Mirza, S. Z. Alsubaie, H. A. Alahmed, G. Al- abduljabbaret al., “Allam: Large language models for arabic and english,”arXiv preprint arXiv:2407.15390, 2024
-
[8]
N. Sengupta, S. K. Sahu, B. Jia, S. Katipomu, H. Li, F. Koto, W. Marshall, G. Gosal, C. Liu, Z. Chenet al., “Jais and jais-chat: Arabic-centric foundation and instruction-tuned open generative large language models,”arXiv preprint arXiv:2308.16149, 2023
-
[9]
Fanar: An arabic-centric multimodal generative ai platform,
F. Team, U. Abbas, M. S. Ahmad, F. Alam, E. Altinisik, E. Asgari, Y. Boshmaf, S. Boughor- bel, S. Chawla, S. Chowdhuryet al., “Fanar: An arabic-centric multimodal generative ai plat- form,”arXiv preprint arXiv:2501.13944, 2025
-
[10]
A survey of large language models for ara- bic language and its dialects,
M. Mashaabi, S. Al-Khalifa, and H. Al-Khalifa, “A survey of large language models for ara- bic language and its dialects,”arXiv preprint arXiv:2410.20238, 2024
work page internal anchor Pith review arXiv 2024
-
[11]
Evaluating ara- bic large language models: A survey of bench- marks, methods, and gaps,
A. Alzubaidi, S. Alsuwaidi, B. E. A. Bous- saha, L. AlQadi, O. Alkaabi, M. Alyafeai, H. Alobeidli, and H. Hacid, “Evaluating ara- bic large language models: A survey of bench- marks, methods, and gaps,”arXiv preprint arXiv:2510.13430, 2025
-
[12]
Arabic natural language processing: Challenges and solutions,
A. Farghaly and K. Shaalan, “Arabic natural language processing: Challenges and solutions,” ACM Transactions on Asian Language Informa- tion Processing (TALIP), vol. 8, no. 4, pp. 1–22, 2009
work page 2009
-
[13]
N. Y. Habash,Introduction to Arabic natural language processing. Morgan & Claypool Pub- lishers, 2010
work page 2010
-
[14]
Halwasa: Quantify and analyze halluci- nations in large language models: Arabic as a case study,
H. Mubarak, H. Al-Khalifa, and K. S. Alkhale- fah, “Halwasa: Quantify and analyze halluci- nations in large language models: Arabic as a case study,” inProceedings of the 2024 Joint In- ternational Conference on Computational Lin- guistics, Language Resources and Evaluation (LREC-COLING 2024), 2024, pp. 8008–8015
work page 2024
-
[15]
RoboClaw: Agentic robotic framework for scalable and long-horizon task execution with VLMs,
S. Abdaljalil, H. Kurban, and E. Serpedin, “Hal- luverse25: Fine-grained multilingual benchmark dataset for llm hallucinations,”arXiv preprint arXiv:2503.07833, 2025
-
[16]
M. Y. Mohammed, S. A. Ali, S. K. Ali, A. A. Majeed, and E. H. Mohamed, “Aftina: enhanc- ing stability and preventing hallucination in ai- based islamic fatwa generation using llms and rag,”Neural Computing and Applications, pp. 1–26, 2025
work page 2025
-
[17]
Islamiceval2025: The first shared task of capturing llms hallucination in islamic content,
H. Mubarak, R. Malhas, W. Mansour, A. Mo- hamed, M.Fawzi, M.Hawasly, T.Elsayed, K.M. Darwish, and W. Magdy, “Islamiceval2025: The first shared task of capturing llms hallucination in islamic content,” inProceedings of The Third Arabic Natural Language Processing Conference: Shared Tasks, 2025, pp. 480–493
work page 2025
-
[18]
arXiv preprint arXiv:2305.11747 (2023)
J. Li, X. Cheng, W. X. Zhao, J.-Y. Nie, and J.- R. Wen, “Halueval: A large-scale hallucination evaluation benchmark for large language mod- els,”arXiv preprint arXiv:2305.11747, 2023
-
[19]
Analyzing llm behavior in dialogue sum- marization: Unveiling circumstantial hallucina- tion trends,
S. Ramprasad, E. Ferracane, and Z. C. Lip- ton, “Analyzing llm behavior in dialogue sum- marization: Unveiling circumstantial hallucina- tion trends,” inProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024, pp. 12549–12561
work page 2024
-
[20]
Evaluating hallucinations in chinese large language models,
Q. Cheng, T. Sun, W. Zhang, S. Wang, X. Liu, M. Zhang, J. He, M. Huang, Z. Yin, K. Chenet al., “Evaluating hallucinations in chinese large language models,”arXiv preprint arXiv:2310.03368, 2023
-
[21]
X. Liang, S. Song, S. Niu, Z. Li, F. Xiong, B. Tang, Y. Wang, D. He, C. Peng, Z. Wang et al., “Uhgeval: Benchmarking the hallucina- tion of chinese large language models via uncon- strained generation,” inProceedings of the 62nd Annual Meeting of the Association for Compu- tational Linguistics (Volume 1: Long Papers), 2024, pp. 5266–5293
work page 2024
-
[22]
C-faith: A chinese fine-grained benchmark for automated halluci- nation evaluation,
X. Zhang, Z. Liu, J. Wang, H. Zhang, F. Xu, J. Zhang, and X. Wan, “C-faith: A chinese fine-grained benchmark for automated halluci- nation evaluation,” inProceedings of the 34th ACM International Conference on Information 24 and Knowledge Management, 2025, pp. 6575– 6579
work page 2025
-
[23]
H. Ding, L. Pang, Z. Wei, H. Shen, and X. Cheng, “Retrieve only when it needs: Adap- tive retrieval augmentation for hallucination mitigation in large language models,”arXiv preprint arXiv:2402.10612, 2024
-
[24]
Exploring rag solu- tions to reduce hallucinations in llms,
S. AboulEla, P. Zabihitari, N. Ibrahim, M. Af- shar, and R. Kashef, “Exploring rag solu- tions to reduce hallucinations in llms,” in 2025 IEEE International systems Conference (SysCon). IEEE, 2025, pp. 1–8
work page 2025
-
[25]
Detecting hallucinations in large language models using semantic entropy,
S. Farquhar, J. Kossen, L. Kuhn, and Y. Gal, “Detecting hallucinations in large language models using semantic entropy,”Nature, vol. 630, no. 8017, pp. 625–630, 2024
work page 2024
-
[26]
En- hancing uncertainty-based hallucination detec- tion with stronger focus,
T. Zhang, L. Qiu, Q. Guo, C. Deng, Y. Zhang, Z. Zhang, C. Zhou, X. Wang, and L. Fu, “En- hancing uncertainty-based hallucination detec- tion with stronger focus,” inProceedings of the 2023 Conference on Empirical Methods in Nat- ural Language Processing, 2023, pp. 915–932
work page 2023
-
[27]
D. Dale, E. Voita, L. Barrault, and M. R. Costa- Jussà, “Detecting and mitigating hallucinations in machine translation: Model internal workings alone do well, sentence similarity even better,” inProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Vol- ume 1: Long Papers), 2023, pp. 36–50
work page 2023
-
[28]
Leveraging graph structures to de- tect hallucinations in large language models,
N. Nonkes, S. Agaronian, E. Kanoulas, and R. Petcu, “Leveraging graph structures to de- tect hallucinations in large language models,” inProceedings of TextGraphs-17: Graph-based Methods for Natural Language Processing, 2024, pp. 93–104
work page 2024
-
[29]
Halugnn: Hallucination detection in large language models using graph neural network,
L. Kong, Y. Zhang, X. Zhong, H. Fu, Y. Wang, and H. Liu, “Halugnn: Hallucination detection in large language models using graph neural network,”Expert Systems with Applications, p. 130857, 2025
work page 2025
-
[30]
Hallushift: Measuring distribution shifts towards hallucination detection in llms,
S.Dasgupta, S.Nath, A.Basu, P.Shamsolmoali, and S. Das, “Hallushift: Measuring distribution shifts towards hallucination detection in llms,” in2025 International Joint Conference on Neu- ral Networks (IJCNN). IEEE, 2025, pp. 1–8
work page 2025
-
[31]
P. Manakul, A. Liusie, and M. Gales, “Selfcheck- gpt: Zero-resource black-box hallucination de- tection for generative large language models,” inProceedings of the 2023 Conference on Em- pirical Methods in Natural Language Processing, 2023, pp. 9004–9017
work page 2023
-
[32]
J.Zhang, Z.Li, K.Das, B.Malin, and S.Kumar, “Sac3: reliable hallucination detection in black- box language models via semantic-aware cross- check consistency,” inFindings of the Associ- ation for Computational Linguistics: EMNLP 2023, 2023, pp. 15445–15458
work page 2023
-
[33]
Ai- generated news articles based on large language models,
K. Jiang, Q. Zhang, D. Guo, D. Huang, S. Zhang, Z. Wei, F. Ning, and R. Li, “Ai- generated news articles based on large language models,” inProceedings of the 2023 Interna- tional Conference on Artificial Intelligence, Sys- tems and Network Security, 2023, pp. 82–87
work page 2023
-
[34]
Self- expertise: knowledge-based instruction dataset augmentation for a legal expert language model,
M. Kim, H. Jung, and M.-W. Koo, “Self- expertise: knowledge-based instruction dataset augmentation for a legal expert language model,” inFindings of the Association for Com- putational Linguistics: NAACL 2024, 2024, pp. 1098–1112
work page 2024
-
[35]
M. Arslan, H. Ghanem, S. Munawar, and C. Cruz, “A survey on rag with llms,”Proce- dia computer science, vol. 246, pp. 3781–3790, 2024
work page 2024
-
[36]
Chain- of-thought prompting elicits reasoning in large language models,
J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V. Le, D. Zhouet al., “Chain- of-thought prompting elicits reasoning in large language models,”Advances in neural informa- tion processing systems, vol. 35, pp. 24824– 24837, 2022
work page 2022
-
[37]
Chain- of-verification reduces hallucination in large lan- guage models,
S. Dhuliawala, M. Komeili, J. Xu, R. Raileanu, X. Li, A. Celikyilmaz, and J. Weston, “Chain- of-verification reduces hallucination in large lan- guage models,” inFindings of the association for computational linguistics: ACL 2024, 2024, pp. 3563–3578
work page 2024
-
[38]
Mitigating large language model hallu- cinationwithfaithfulfinetuning,
M. Hu, B. He, Y. Wang, L. Li, C. Ma, and I. King, “Mitigating large language model hallu- cinationwithfaithfulfinetuning,”arXiv preprint arXiv:2406.11267, 2024
-
[39]
DoLa: Decoding by Contrasting Layers Improves Factuality in Large Language Models
Y.-S. Chuang, Y. Xie, H. Luo, Y. Kim, J. Glass, and P. He, “Dola: Decoding by contrasting lay- 25 ers improves factuality in large language mod- els,”arXiv preprint arXiv:2309.03883, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[40]
E. Durmus, H. He, and M. Diab, “Feqa: A ques- tion answering evaluation framework for faith- fulness assessment in abstractive summariza- tion,” inProceedings of the 58th Annual Meet- ing of the Association for Computational Lin- guistics, 2020, pp. 5055–5070
work page 2020
-
[41]
Evaluating the factual consistency of abstractive text summarization,
W. Kryściński, B. McCann, C. Xiong, and R. Socher, “Evaluating the factual consistency of abstractive text summarization,” inProceed- ings of the 2020 Conference on Empirical Meth- ods in Natural Language Processing (EMNLP), 2020, pp. 9332–9346
work page 2020
-
[42]
Factscore: Fine-grained atomic evalu- ation of factual precision in long form text gen- eration,
S.Min, K.Krishna, X.Lyu, M.Lewis, W.-t.Yih, P. Koh, M. Iyyer, L. Zettlemoyer, and H. Ha- jishirzi, “Factscore: Fine-grained atomic evalu- ation of factual precision in long form text gen- eration,” inProceedings of the 2023 Conference on Empirical Methods in Natural Language Pro- cessing, 2023, pp. 12076–12100
work page 2023
-
[43]
How reliable are automatic eval- uation methods for instruction-tuned llms?
E. Doostmohammadi, O. Holmström, and M. Kuhlmann, “How reliable are automatic eval- uation methods for instruction-tuned llms?” in Findings of the Association for Computational Linguistics: EMNLP 2024, 2024, pp. 6321–6336
work page 2024
-
[44]
Truthfulqa: Measuring how models mimic human false- hoods,
S. Lin, J. Hilton, and O. Evans, “Truthfulqa: Measuring how models mimic human false- hoods,” inProceedings of the 60th Annual Meet- ing of the Association for Computational Lin- guistics (Volume 1: Long Papers), 2022, pp. 3214–3252
work page 2022
-
[45]
Freshllms: Refreshing large language models with search engine augmentation,
T. Vu, M. Iyyer, X. Wang, N. Constant, J. Wei, J.Wei, C.Tar, Y.-H.Sung, D.Zhou, Q.Leet al., “Freshllms: Refreshing large language models with search engine augmentation,” inFindings of the Association for Computational Linguistics ACL 2024, 2024, pp. 13697–13720
work page 2024
-
[46]
Assessing the factual accuracy of generated text,
B. Goodrich, V. Rao, P. J. Liu, and M. Saleh, “Assessing the factual accuracy of generated text,” inproceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining, 2019, pp. 166–175
work page 2019
-
[47]
Generativeaiforislamictexts: Theeman framework for mitigating gpt hallucinations,
A.ElGanadi, S.Aftar, L.Gagliardelli, F.Ruozzi et al., “Generativeaiforislamictexts: Theeman framework for mitigating gpt hallucinations,” in roceedings of the 17th International Conference on Agents and Artificial Intelligence-ICAART, vol. 3, 2025, pp. 1221–1228
work page 2025
-
[48]
M. F. Alghifari, M. Kartiwi, M. B. A. Zaim, and D. O. D. Handayani, “Mitigating llm hal- lucinations in quranic content: An agentic ap- proach using deployable language models,” in 2025 10th International Conference on Infor- mation and Communication Technology for the Muslim World (ICT4M). IEEE, 2025, pp. 1–6
work page 2025
-
[49]
R. Vázquez, T. Mickus, E. Zosa, T. Vah- tola, J. Tiedemann, A. Sinha, V. Segonne, F. Sánchez-Vega, A. Raganato, J. Libovick` y et al., “Semeval-2025 task 3: Mu-shroom, the multilingual shared task on hallucinations and related observable overgeneration mistakes,” arXiv preprint arXiv:2504.11975, 2025
-
[50]
D. Dale, E. Voita, J. Lam, P. Hansanti, C. Rop- ers, E. Kalbassi, C. Gao, L. Barrault, and M. Costa-jussà, “Halomi: A manually anno- tated benchmark for multilingual hallucination and omission detection in machine translation,” inProceedings of the 2023 Conference on Em- pirical Methods in Natural Language Processing, 2023, pp. 638–653
work page 2023
-
[51]
H. Zhang, S. Anjum, H. Fan, W. Zheng, Y. Huang, and Y. Feng, “Poly-fever: A multi- lingualfactverificationbenchmarkforhallucina- tion detection in large language models,”arXiv preprint arXiv:2503.16541, 2025
-
[52]
Hot- potqa: A dataset for diverse, explainable multi- hop question answering,
Z. Yang, P. Qi, S. Zhang, Y. Bengio, W. Cohen, R. Salakhutdinov, and C. D. Manning, “Hot- potqa: A dataset for diverse, explainable multi- hop question answering,” inProceedings of the 2018 conference on empirical methods in natu- ral language processing, 2018, pp. 2369–2380
work page 2018
-
[53]
Triviaqa: A large scale distantly super- vised challenge dataset for reading comprehen- sion,
M. Joshi, E. Choi, D. S. Weld, and L. Zettle- moyer, “Triviaqa: A large scale distantly super- vised challenge dataset for reading comprehen- sion,” inProceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2017, pp. 1601–1611
work page 2017
-
[54]
S. Pandit, J. Xu, J. Hong, Z. Wang, T. Chen, K. Xu, and Y. Ding, “Medhallu: A comprehen- sive benchmark for detecting medical hallucina- tions in large language models,” inProceedings 26 of the 2025 Conference on Empirical Methods in Natural Language Processing, 2025, pp. 2858– 2873
work page 2025
-
[55]
Defan: Definitive answer dataset for llm hallucination evaluation,
A. A. Rahman, S. Anwar, M. Usman, I. Ahmad, and A. Mian, “Defan: Definitive answer dataset for llm hallucination evaluation,”Information, vol. 16, no. 11, p. 937, 2025
work page 2025
-
[56]
Naseej launches its innovative arabic ai language model “noon
Naseej for Technology, “Naseej launches its innovative arabic ai language model “noon” as an open-source initiative,” Jun. 19 2023, accessed: 2025-07-02. [Online]. Available: https: //naseej.com/news/2023/06/
work page 2023
-
[57]
Introducing claude sonnet 4.5,
Anthropic, “Introducing claude sonnet 4.5,” 2025, accessed: 2026-03-18. [Online]. Available: https://www.anthropic.com/news/ claude-sonnet-4-5
work page 2025
-
[58]
A. Liu, B. Feng, B. Xue, B. Wang, B. Wu, C. Lu, C. Zhao, C. Deng, C. Zhang, C. Ruanet al., “Deepseek-v3 technical report,”arXiv preprint arXiv:2412.19437, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [59]
-
[60]
J. Achiam, S. Adler, S. Agarwal, L. Ah- mad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat et al., “Gpt-4 technical report,”arXiv preprint arXiv:2303.08774, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[61]
A. Singh, A. Fry, A. Perelman, A. Tart, A. Ganesh, A. El-Kishky, A. McLaughlin, A. Low, A. Ostrow, A. Ananthramet al., “Openai gpt-5 system card,”arXiv preprint arXiv:2601.03267, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[62]
Llama-4-maverick- 17b-128e-instruct-fp8,
Meta AI, “Llama-4-maverick- 17b-128e-instruct-fp8,” https://ai.azure.com/catalog/models/Llama- 4-Maverick-17B-128E-Instruct-FP8, 2025, azure AI Foundry model catalog. Accessed: 2026-03-18
work page 2025
-
[63]
Alibaba Qwen Team, “Qwen3-next- 80b-a3b-instruct,” 2025, qwen offi- cial blog. Accessed: 2026-03-18. [On- line]. Available: https://qwen.ai/blog?id= 4074cca80393150c248e508aa62983f9cb7d27cd
work page 2025
-
[64]
Qwen3-235b-a22b-instruct-2507-fp8,
——, “Qwen3-235b-a22b-instruct-2507-fp8,” 2025, together AI model cata- log. Accessed: 2026-03-18. [Online]. Available: https://www.together.ai/models/ qwen3-235b-a22b-instruct-2507-fp8
work page 2025
-
[65]
System card: Claude opus 4 and claude sonnet 4,
Anthropic, “System card: Claude opus 4 and claude sonnet 4,” Anthropic, Tech. Rep., 2025, accessed: 2026-03-18. [Online]. Available: https://www-cdn.anthropic.com/ 4263b940cabb546aa0e3283f35b686f4f3b2ff47. pdf
work page 2025
-
[66]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P. Wang, X. Biet al., “Deepseek-r1: Incentivizing reasoning capabil- ity in llms via reinforcement learning,”arXiv preprint arXiv:2501.12948, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[67]
Openai o3 and o4-mini system card,
OpenAI, “Openai o3 and o4-mini system card,” OpenAI,Tech.Rep., 2025, accessed: 2026-03-18. [Online]. Available: https://cdn.openai.com/ pdf/2221c875-02dc-4789-800b-e7758f3722c1/ o3-and-o4-mini-system-card.pdf
work page 2025
-
[68]
The double-edged sword of anthro- pomorphism in llms,
M. G. Reinecke, F. Ting, J. Savulescu, and I. Singh, “The double-edged sword of anthro- pomorphism in llms,” inProceedings, vol. 114, no. 1. MDPI, 2025, p. 4
work page 2025
-
[69]
Breaking the illusion: Revisiting llm anthropomorphism,
C. Sypherd, W. Tang, and V. Belle, “Breaking the illusion: Revisiting llm anthropomorphism,” inThe 4th International Conference on Human and Artificial Rationalities. Springer Nature, 2025, pp. 1–19. 27
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.