OpenHalDet: A Unified Benchmark for Hallucination Detection across Diverse Generation Scenarios

Changdae Oh; Hanchen Wang; Hongnan Ma; Jinyuan Luo; Ling Chen; Mengyue Yang; Sean Du; Shanshan Ye; Sharon Li; Shu-Lin Chen

arxiv: 2606.06959 · v1 · pith:W7CQJ3EWnew · submitted 2026-06-05 · 💻 cs.CL · cs.AI

OpenHalDet: A Unified Benchmark for Hallucination Detection across Diverse Generation Scenarios

Xinyi Li , Zhen Fang , Yongxin Deng , Jinyuan Luo , Hongnan Ma , Changdae Oh , Zijing Shi , Shanshan Ye

show 7 more authors

Hanchen Wang Shu-Lin Chen Yadan Luo Mengyue Yang Sean Du Sharon Li Ling Chen

This is my paper

Pith reviewed 2026-06-27 21:49 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords hallucination detectionbenchmarklarge language modelsevaluationblack-box methodsgray-box methodswhite-box methods

0 comments

The pith

OpenHalDet standardizes the full evaluation pipeline for hallucination detectors in LLMs to allow direct comparisons across tasks and access levels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces OpenHalDet as a unified benchmark that standardizes every step of hallucination detection evaluation, from constructing prompts and generating responses to annotating truthfulness, scoring detectors, and computing metrics. It tackles inconsistent setups and narrow task coverage in prior work by supporting a wide range of tasks, models, and detectors in one shared framework. The benchmark explicitly handles black-box methods using only outputs, gray-box methods using probability signals, and white-box methods using internal signals. A reader would care because consistent evaluation makes it possible to identify which detection approaches actually work reliably when deploying LLMs.

Core claim

OpenHalDet is a unified benchmark that standardizes the evaluation pipeline from prompt construction and response generation to truthfulness annotation, detector scoring, and metric computation. It supports heterogeneous detector families under black-box, gray-box, and white-box access settings. By bringing diverse tasks, models, and detectors into one framework, it enables controlled comparison and provides a systematic view of how different detection paradigms behave in LLM applications. The code and datasets are released as an open and extensible codebase.

What carries the argument

The OpenHalDet benchmark, a standardized pipeline that unifies prompt construction, generation, annotation, scoring, and metrics while supporting black-box, gray-box, and white-box detectors.

If this is right

Detector performances reported under the benchmark become directly comparable across different studies and settings.
A systematic view emerges of how black-box, gray-box, and white-box paradigms behave across LLM applications.
Reproducible evaluation becomes possible through the shared open codebase.
New detection methods can be developed and tested under the same standardized conditions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The benchmark could reveal which access level (black-box versus white-box) delivers the best trade-off for particular downstream domains.
Extending the covered tasks to include more recent model families would test whether the standardization holds as LLMs evolve.
Applying the pipeline inside production systems might expose practical gaps between benchmark scores and real deployment reliability.

Load-bearing premise

That making inference configurations and evaluation steps consistent across studies will make detector performance directly comparable and generalizable beyond the tested settings.

What would settle it

Running multiple detectors through the OpenHalDet pipeline on the same tasks and finding that their relative performance rankings still flip when only the random seed or minor prompt wording changes.

Figures

Figures reproduced from arXiv: 2606.06959 by Changdae Oh, Hanchen Wang, Hongnan Ma, Jinyuan Luo, Ling Chen, Mengyue Yang, Sean Du, Shanshan Ye, Sharon Li, Shu-Lin Chen, Xinyi Li, Yadan Luo, Yongxin Deng, Zhen Fang, Zijing Shi.

**Figure 2.** Figure 2: Accuracy–cost trade-offs on Llama-3.2-3B-Instruct across representative scenarios. [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗

read the original abstract

Hallucination detection is essential for the reliable deployment of large language models (LLMs). However, existing evaluations face two core challenges: inconsistent inference configuration and evaluation, and limited coverage of downstream domains and tasks. Consequently, reported detector performance is often difficult to compare, reproduce, and generalize beyond specific experimental settings. We introduce OpenHalDet, a unified benchmark for hallucination detection across diverse generation scenarios. OpenHalDet standardizes the evaluation pipeline, from prompt construction and response generation to truthfulness annotation, detector scoring, and metric computation. It supports heterogeneous detector families under different access settings, including black-box methods that use only generated outputs, gray-box methods that rely on probability-based signals, and white-box methods that exploit internal model signals. By bringing diverse tasks, models, and detectors into a shared framework, OpenHalDet enables controlled comparison and provides a systematic view of how different detection paradigms behave in LLM applications. We release OpenHalDet as an open and extensible codebase to facilitate reproducible evaluation and future development of hallucination detection methods. The code and datasets are available at https://github.com/Nellie179/Hallucination-Detection.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

OpenHalDet offers a standardized benchmark and codebase for hallucination detection but the abstract supplies no results to show whether the standardization actually improves comparisons.

read the letter

The paper's core move is releasing OpenHalDet, a benchmark that tries to fix inconsistent setups in hallucination detection work by locking down the full pipeline from prompt construction through response generation, truthfulness labeling, detector scoring, and metrics. It explicitly supports black-box, gray-box, and white-box detectors in one framework and covers more tasks and models than most prior evaluations.

What stands out is the practical engineering: they identify the reproducibility problems clearly and respond with an open codebase plus datasets. That kind of shared testbed can save time for groups that want to compare methods without rebuilding the evaluation stack each time.

The main limitation is that nothing in the provided text shows actual numbers, ablation checks, or validation of the annotation process. Without those, it's impossible to judge whether the standardized pipeline removes the inconsistencies it targets or whether the chosen tasks and models are representative enough. The claim that it enables controlled comparison rests on the design description alone.

This is aimed at researchers who run or evaluate hallucination detectors and need a common reference point. Someone working on LLM reliability or deployment tools could use the released code to run their own comparisons.

I would send it for peer review. The engineering goal is clear and the release lowers the barrier for others; referees can check the implementation details and any experiments that appear in the full version.

Referee Report

1 major / 1 minor

Summary. The manuscript introduces OpenHalDet, a unified benchmark for hallucination detection in LLMs. It standardizes the full evaluation pipeline (prompt construction, response generation, truthfulness annotation, detector scoring, and metrics) across diverse tasks and models, supporting black-box, gray-box, and white-box detectors, and releases an open extensible codebase and datasets to enable reproducible comparisons.

Significance. If the standardization is correctly implemented and the released resources are comprehensive, OpenHalDet could provide a valuable common platform for comparing hallucination detectors, reducing inconsistencies in prior work and supporting systematic analysis of detection paradigms across access settings.

major comments (1)

[Abstract] Abstract: the central claim that the benchmark 'enables controlled comparison and provides a systematic view of how different detection paradigms behave' is not supported by any reported experiments, baseline results, or validation metrics in the provided description; without such evidence the utility of the standardization cannot be assessed.

minor comments (1)

The description of supported tasks, models, and specific detectors is high-level; adding an explicit table or section listing coverage would strengthen the claim of 'diverse generation scenarios'.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the single major comment below and indicate the planned revisions.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that the benchmark 'enables controlled comparison and provides a systematic view of how different detection paradigms behave' is not supported by any reported experiments, baseline results, or validation metrics in the provided description; without such evidence the utility of the standardization cannot be assessed.

Authors: We agree that the abstract phrasing should be tightened to avoid overstating what is demonstrated. The manuscript's core contribution is the standardized pipeline and released codebase that make controlled comparisons possible across black-, gray-, and white-box detectors; the paper itself contains initial cross-detector evaluations on the included tasks and models. We will revise the abstract to state that OpenHalDet supplies the common infrastructure and datasets for such comparisons, and we will add an explicit reference to the baseline results already present in the experimental section so that the claim is directly supported by reported metrics. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper introduces OpenHalDet as a standardized benchmark for hallucination detection, focusing on unifying evaluation pipelines across tasks, models, and detector types (black/gray/white-box). No derivation chain, mathematical predictions, fitted parameters, or first-principles results are claimed or present. The central contribution is the benchmark definition, standardization of configs, and code release itself, with no self-definitional reductions, fitted inputs renamed as predictions, or load-bearing self-citations that collapse the argument. The abstract and description describe an engineering solution to inconsistent evaluations without internal circular structure.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Benchmark paper; central claim rests on the premise that standardization improves comparability. No free parameters, mathematical axioms, or new entities are introduced.

pith-pipeline@v0.9.1-grok · 5783 in / 981 out tokens · 41482 ms · 2026-06-27T21:49:04.058601+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

68 extracted references · 8 linked inside Pith

[1]

A survey of large language models, 2026

Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, Yifan Du, Chen Yang, Yushuo Chen, Zhipeng Chen, Jinhao Jiang, Ruiyang Ren, Yifan Li, Xinyu Tang, Zikang Liu, Peiyu Liu, Jian-Yun Nie, and Ji-Rong Wen. A survey of large language models, 2026

2026
[2]

Enhancing hallucination detection through noise injection.CoRR, abs/2502.03799, 2025

Litian Liu, Reza Pourreza, Sunny Panchal, Apratim Bhattacharyya, Ya-Qin Zhang, and Roland Memisevic. Enhancing hallucination detection through noise injection.CoRR, abs/2502.03799, 2025

Pith/arXiv arXiv 2025
[3]

Steer LLM latents for hallucination detection

Seongheon Park, Xuefeng Du, Min-Hsuan Yeh, Haobo Wang, and Yixuan Li. Steer LLM latents for hallucination detection. InForty-second International Conference on Machine Learning, 2025

2025
[4]

Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension.ACL, 2017

Mandar Joshi, Eunsol Choi, Daniel S Weld, and Luke Zettlemoyer. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension.ACL, 2017

2017
[5]

Truthfulqa: Measuring how models mimic human falsehoods.ACL, 2022

Stephanie Lin, Jacob Hilton, and Owain Evans. Truthfulqa: Measuring how models mimic human falsehoods.ACL, 2022

2022
[6]

don’t forget the teachers

Emma Harvey, Allison Koenecke, and René F. Kizilcec. "don’t forget the teachers": Towards an educator-centered understanding of harms from large language models in education. In Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems, pages 1–19, 2025

2025
[7]

The clinicians’ guide to large language models: A general perspective with a focus on hallucinations.Interactive Journal of Medical Research, 14:e59823, 2025

Dimitri Roustan and François Bastardot. The clinicians’ guide to large language models: A general perspective with a focus on hallucinations.Interactive Journal of Medical Research, 14:e59823, 2025

2025
[8]

Deficiency of large language models in finance: An empirical examination of hallucination.arXiv preprint arXiv:2311.15548, 2023

Haoqiang Kang and Xiao-Yang Liu. Deficiency of large language models in finance: An empirical examination of hallucination.arXiv preprint arXiv:2311.15548, 2023

arXiv 2023
[9]

On faithfulness and factuality in abstractive summarization

Joshua Maynez, Shashi Narayan, Bernd Bohnet, and Ryan McDonald. On faithfulness and factuality in abstractive summarization. InProceedings of the 58th annual meeting of the association for computational linguistics, pages 1906–1919, 2020

1906
[10]

Adam Tauman Kalai and Santosh S. Vempala. Calibrated language models must hallucinate. In Proceedings of the 56th Annual ACM Symposium on Theory of Computing, STOC 2024, page 160–171, New York, NY , USA, 2024. Association for Computing Machinery

2024
[11]

Llms will always hallucinate, and we need to live with this

Sourav Banerjee, Ayushi Agarwal, and Saloni Singla. Llms will always hallucinate, and we need to live with this. InIntelligent Systems Conference, pages 624–648. Springer, 2025

2025
[12]

Why language models hallucinate.arXiv preprint arXiv:2509.04664, 2025

Adam Tauman Kalai, Ofir Nachum, Santosh S Vempala, and Edwin Zhang. Why language models hallucinate.arXiv preprint arXiv:2509.04664, 2025

Pith/arXiv arXiv 2025
[13]

Adam Tauman Kalai and Santosh S. Vempala. Calibrated language models must hallucinate. In Proceedings of the 56th Annual ACM Symposium on Theory of Computing, 2024

2024
[14]

On the limits of language generation: Trade-offs between hallucination and mode-collapse

Alkis Kalavasis, Anay Mehrotra, and Grigoris Velegkas. On the limits of language generation: Trade-offs between hallucination and mode-collapse. InProceedings of the 57th Annual ACM Symposium on Theory of Computing, 2025

2025
[15]

Out-of-distribution detection and selective generation for conditional language models.arXiv preprint arXiv:2209.15558, 2022

Jie Ren, Jiaming Luo, Yao Zhao, Kundan Krishna, Mohammad Saleh, Balaji Lakshminarayanan, and Peter J Liu. Out-of-distribution detection and selective generation for conditional language models.arXiv preprint arXiv:2209.15558, 2022. 10

arXiv 2022
[16]

Halueval: A large-scale hallucination evaluation benchmark for large language models

Junyi Li, Xiaoxue Cheng, Wayne Xin Zhao, Jian-Yun Nie, and Ji-Rong Wen. Halueval: A large-scale hallucination evaluation benchmark for large language models. InProceedings of the 2023 conference on empirical methods in natural language processing, pages 6449–6464, 2023

2023
[17]

SelfcheckGPT: Zero-resource black-box hallucination detection for generative large language models

Potsawee Manakul, Adian Liusie, and Mark Gales. SelfcheckGPT: Zero-resource black-box hallucination detection for generative large language models. InThe 2023 Conference on Empirical Methods in Natural Language Processing, 2023

2023
[18]

Detecting hallucinations in large language models using semantic entropy.Nature, 630:625 – 630, 2024

Sebastian Farquhar, Jannik Kossen, Lorenz Kuhn, and Yarin Gal. Detecting hallucinations in large language models using semantic entropy.Nature, 630:625 – 630, 2024

2024
[19]

A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions.ACM Trans

Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qiang- long Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, and Ting Liu. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions.ACM Trans. Inf. Syst., 2025

2025
[20]

Siren’s song in the AI ocean: A survey on hallucination in large language models

Yue Zhang, Yafu Li, Leyang Cui, Deng Cai, Lemao Liu, Tingchen Fu, Xinting Huang, Enbo Zhao, Yu Zhang, Yulong Chen, Longyue Wang, Anh Tuan Luu, Wei Bi, Freda Shi, and Shuming Shi. Siren’s song in the AI ocean: A survey on hallucination in large language models. Computational Linguistics, 2025

2025
[21]

Lin, Jacob Hilton, and Owain Evans

Stephanie C. Lin, Jacob Hilton, and Owain Evans. Teaching models to express their uncertainty in words.Trans. Mach. Learn. Res., 2022, 2022

2022
[22]

Generating with confidence: Uncertainty quan- tification for black-box large language models.Transactions on Machine Learning Research, 2024

Zhen Lin, Shubhendu Trivedi, and Jimeng Sun. Generating with confidence: Uncertainty quan- tification for black-box large language models.Transactions on Machine Learning Research, 2024

2024
[23]

Out-of-distribution detection and selective generation for conditional language models

Jie Ren, Jiaming Luo, Yao Zhao, Kundan Krishna, Mohammad Saleh, Balaji Lakshminarayanan, and Peter J Liu. Out-of-distribution detection and selective generation for conditional language models. InThe Eleventh International Conference on Learning Representations, 2023

2023
[24]

Uncertainty estimation in autoregressive structured prediction

Andrey Malinin and Mark Gales. Uncertainty estimation in autoregressive structured prediction. InInternational Conference on Learning Representations, 2021

2021
[25]

Shifting attention to relevance: Towards the predictive uncertainty quantification of free-form large language models

Jinhao Duan, Hao Cheng, Shiqi Wang, Alex Zavalny, Chenan Wang, Renjing Xu, Bhavya Kailkhura, and Kaidi Xu. Shifting attention to relevance: Towards the predictive uncertainty quantification of free-form large language models. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computa- ti...

2024
[26]

INSIDE: LLMs’ internal states retain the power of hallucination detection

Chao Chen, Kai Liu, Ze Chen, Yi Gu, Yue Wu, Mingyuan Tao, Zhihang Fu, and Jieping Ye. INSIDE: LLMs’ internal states retain the power of hallucination detection. InThe Twelfth International Conference on Learning Representations, 2024

2024
[27]

Discovering latent knowledge in language models without supervision

Collin Burns, Haotian Ye, Dan Klein, and Jacob Steinhardt. Discovering latent knowledge in language models without supervision. InThe Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023

2023
[28]

The internal state of an LLM knows when it’s lying

Amos Azaria and Tom Mitchell. The internal state of an LLM knows when it’s lying. InThe 2023 Conference on Empirical Methods in Natural Language Processing, 2023

2023
[29]

Mitigat- ing llm hallucinations via conformal abstention.arXiv preprint arXiv:2405.01563, 2024

Yasin Abbasi Yadkori, Ilja Kuzborskij, David Stutz, András György, Adam Fisch, Arnaud Doucet, Iuliya Beloshapka, Wei-Hung Weng, Yao-Yuan Yang, Csaba Szepesvári, et al. Mitigat- ing llm hallucinations via conformal abstention.arXiv preprint arXiv:2405.01563, 2024

arXiv 2024
[30]

A framework to assess clinical safety and hallucination rates of LLMs for medical text summarisation.npj Digital Medicine, 8(1):274, 2025

Elham Asgari, Nina Montaña-Brown, Magda Dubois, Saleh Khalil, Jasmine Balloch, Joshua Au Yeung, and Dominic Pimenta. A framework to assess clinical safety and hallucination rates of LLMs for medical text summarisation.npj Digital Medicine, 8(1):274, 2025. 11

2025
[31]

RedeEP: Detecting hallucination in retrieval-augmented generation via mechanistic interpretability

ZhongXiang Sun, Xiaoxue Zang, Kai Zheng, Jun Xu, Xiao Zhang, Weijie Yu, Yang Song, and Han Li. RedeEP: Detecting hallucination in retrieval-augmented generation via mechanistic interpretability. InThe Thirteenth International Conference on Learning Representations, 2025

2025
[32]

Spectral guardrails for agents in the wild: Detecting tool use hallucinations via attention topology.arXiv preprint arXiv:2602.08082, 2026

Valentin Noël. Spectral guardrails for agents in the wild: Detecting tool use hallucinations via attention topology.arXiv preprint arXiv:2602.08082, 2026

arXiv 2026
[33]

The illusion of progress: Re-evaluating hallucination detection in llms

Denis Janiak, Jakub Binkowski, Albert Sawczyn, Bogdan Gabrys, Ravid Shwartz-Ziv, and Tomasz Kajdanowicz. The illusion of progress: Re-evaluating hallucination detection in llms. InConference on Empirical Methods in Natural Language Processing, 2025

2025
[34]

HalluLens: LLM hallucination benchmark

Yejin Bang, Ziwei Ji, Alan Schelten, Anthony Hartshorn, Tara Fowler, Cheng Zhang, Nicola Cancedda, and Pascale Fung. HalluLens: LLM hallucination benchmark. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), July 2025

2025
[35]

Hallumix: A task-agnostic, multi-domain benchmark for real-world hallucination detection

Deanna Emery, Michael Goitia, Freddie Vargus, and Iulia Neagu. Hallumix: A task-agnostic, multi-domain benchmark for real-world hallucination detection. 2025

2025
[36]

ICR probe: Tracking hidden state dynamics for reliable hallucination detection in LLMs

Zhenliang Zhang, Xinyu Hu, Huixuan Zhang, Junzhe Zhang, and Xiaojun Wan. ICR probe: Tracking hidden state dynamics for reliable hallucination detection in LLMs. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Pape...

2025
[37]

Evaluating evaluation metrics — the mirage of hallucination detection

Atharva Kulkarni*, Yuan Zhang, Joel Ruben Antony Moniz*, Xiou Ge, Bo-Hsiang Tseng, Dhivya Piraviperumal, Swabha Swayamdipta, and Hong Yu. Evaluating evaluation metrics — the mirage of hallucination detection. InEMNLP, 2025

2025
[38]

HalluCounter: Reference-free LLM hallucination detection in the wild! In Kentaro Inui, Sakriani Sakti, Haofen Wang, Derek F

Ashok Urlana, Gopichand Kanumolu, Charaka Vinayak Kumar, Bala Mallikarjunarao Garlapati, and Rahul Mishra. HalluCounter: Reference-free LLM hallucination detection in the wild! In Kentaro Inui, Sakriani Sakti, Haofen Wang, Derek F. Wong, Pushpak Bhattacharyya, Biplab Banerjee, Asif Ekbal, Tanmoy Chakraborty, and Dhirendra Pratap Singh, editors,Proceedings...

2025
[39]

Haloscope: Harnessing unlabeled LLM generations for hallucination detection

Xuefeng Du, Chaowei Xiao, and Yixuan Li. Haloscope: Harnessing unlabeled LLM generations for hallucination detection. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

2024
[40]

Beyond in-domain detection: Spikescore for cross-domain hallucination detection

Yongxin Deng, Zhen Fang, Sharon Li, and Ling Chen. Beyond in-domain detection: Spikescore for cross-domain hallucination detection. InThe Fourteenth International Conference on Learning Representations, 2026

2026
[41]

Halluentity: Benchmarking and understanding entity-level hallucination detection.Transactions on Machine Learning Research, 2025

Min-Hsuan Yeh, Max Kamachee, Seongheon Park, and Yixuan Li. Halluentity: Benchmarking and understanding entity-level hallucination detection.Transactions on Machine Learning Research, 2025

2025
[42]

FActscore: Fine-grained atomic evaluation of factual precision in long form text generation

Sewon Min, Kalpesh Krishna, Xinxi Lyu, Mike Lewis, Wen tau Yih, Pang Wei Koh, Mohit Iyyer, Luke Zettlemoyer, and Hannaneh Hajishirzi. FActscore: Fine-grained atomic evaluation of factual precision in long form text generation. InEMNLP, 2023

2023
[43]

Think you have solved question answering? try arc, the ai2 reasoning challenge.ArXiv, abs/1803.05457, 2018

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge.ArXiv, abs/1803.05457, 2018

Pith/arXiv arXiv 2018
[44]

CommonsenseQA: A question answering challenge targeting commonsense knowledge

Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. CommonsenseQA: A question answering challenge targeting commonsense knowledge. In Jill Burstein, Christy Doran, and Thamar Solorio, editors,Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volum...

2019
[45]

Know what you don’t know: Unanswerable questions for SQuAD

Pranav Rajpurkar, Robin Jia, and Percy Liang. Know what you don’t know: Unanswerable questions for SQuAD. In Iryna Gurevych and Yusuke Miyao, editors,Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Melbourne, Australia, July 2018. Association for Computational Linguistics

2018
[46]

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. HotpotQA: A dataset for diverse, explainable multi-hop question answering. In Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun’ichi Tsujii, editors, Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processi...

2018
[47]

Siva Reddy, Danqi Chen, and Christopher D. Manning. CoQA: A conversational question answering challenge.Transactions of the Association for Computational Linguistics, 7, 2019

2019
[48]

RAGTruth: A hallucination corpus for developing trustworthy retrieval- augmented language models

Cheng Niu, Yuanhao Wu, Juno Zhu, Siliang Xu, KaShun Shum, Randy Zhong, Juntong Song, and Tong Zhang. RAGTruth: A hallucination corpus for developing trustworthy retrieval- augmented language models. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors, Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: ...

2024
[49]

Cohen, and Mirella Lapata

Shashi Narayan, Shay B. Cohen, and Mirella Lapata. Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization. In Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun’ichi Tsujii, editors,Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, Octobe...

2018
[50]

Training verifiers to solve math word problems.ArXiv, abs/2110.14168, 2021

Karl Cobbe, Vineet Kosaraju, Mo Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems.ArXiv, abs/2110.14168, 2021

Pith/arXiv arXiv 2021
[51]

Arkil Patel, Satwik Bhattamishra, and Navin Goyal. Are NLP models really able to solve simple math word problems? In Kristina Toutanova, Anna Rumshisky, Luke Zettlemoyer, Dilek Hakkani-Tur, Iz Beltagy, Steven Bethard, Ryan Cotterell, Tanmoy Chakraborty, and Yichao Zhou, editors,Proceedings of the 2021 Conference of the North American Chapter of the Associ...

2021
[52]

Association for Computational Linguistics
[53]

TheoremQA: A theorem-driven question answering dataset

Wenhu Chen, Ming Yin, Max Ku, Pan Lu, Yixin Wan, Xueguang Ma, Jianyu Xu, Xinyi Wang, and Tony Xia. TheoremQA: A theorem-driven question answering dataset. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Singapore, December 2023. Association for Computational Linguistics

2023
[54]

Cummings, Matthias Plappert, Fotios Chantzis, Elizabeth Barnes, Ariel Herbert-V oss, William H

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Pondé, Jared Kaplan, Harrison Edwards, Yura Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mo Bavarian, Clemens Winter, Phi...

Pith/arXiv arXiv 2021
[55]

Cai, Michael Terry, Quoc V

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie J. Cai, Michael Terry, Quoc V . Le, and Charles Sutton. Program synthesis with large language models.ArXiv, 2021

2021
[56]

xLAM: A family of large action models to empower 13 AI agent systems

Jianguo Zhang, Tian Lan, Ming Zhu, Zuxin Liu, Thai Hoang, Shirley Kokane, Weiran Yao, Juntao Tan, Zhiwei Liu, Yihao Feng, Juan Carlos Niebles, Shelby Heinecke, Huan Wang, Silvio Savarese, and Caiming Xiong. xLAM: A family of large action models to empower 13 AI agent systems. In Luis Chiruzzo, Alan Ritter, and Lu Wang, editors,Proceedings of the 2025 Conf...

2025
[57]

The belebele benchmark: a parallel reading comprehension dataset in 122 language variants

Lucas Bandarkar, Davis Liang, Benjamin Muller, Mikel Artetxe, Satya Narayan Shukla, Donald Husa, Naman Goyal, Abhinandan Krishnan, Luke Zettlemoyer, and Madian Khabsa. The belebele benchmark: a parallel reading comprehension dataset in 122 language variants. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedings of the 62nd Annual Meeting o...

2024
[58]

Wong, and Rui Wang

Yiming Wang, Pei Zhang, Baosong Yang, Derek F. Wong, and Rui Wang. Latent space chain-of-embedding enables output-free LLM self-evaluation. InThe Thirteenth International Conference on Learning Representations, 2025

2025
[59]

Xing, Hao Zhang, Joseph E

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena. InProceedings of the 37th International Conference on Neural Information Processing Systems, NIPS ’23, Red Hook, NY , USA...

2023
[60]

Brown, Jack Clark, Nicholas Joseph, Benjamin Mann, Sam McCandlish, Chris Olah, and Jared Kaplan

Saurav Kadavath, Tom Conerly, Amanda Askell, Thomas Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zachary Dodds, Nova Dassarma, Eli Tran-Johnson, Scott Johnston, Sheer El-Showk, Andy Jones, Nelson Elhage, Tristan Hume, Anna Chen, Yuntao Bai, Sam Bowman, Stanislav Fort, Deep Ganguli, Danny Hernandez, Josh Jacobson, John Kernion, Shauna Kravec, Lian...

Pith/arXiv arXiv 2022
[61]

Unsupervised real-time hallucination detection based on the internal states of large language models

Weihang Su, Changyue Wang, Qingyao Ai, Yiran Hu, Zhijing Wu, Yujia Zhou, and Yiqun Liu. Unsupervised real-time hallucination detection based on the internal states of large language models. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Findings of the Association for Computational Linguistics: ACL 2024, Bangkok, Thailand, August 2024. Associat...

2024
[62]

Semantic entropy probes: Robust and cheap hallucination detection in LLMs, 2025

Jannik Kossen, Jiatong Han, Muhammed Razzak, Lisa Schut, Shreshth A Malik, and Yarin Gal. Semantic entropy probes: Robust and cheap hallucination detection in LLMs, 2025

2025
[63]

Prompt-guided internal states for hallucination detection of large language models

Fujie Zhang, Peiqi Yu, Biao Yi, Baolei Zhang, Tong Li, and Zheli Liu. Prompt-guided internal states for hallucination detection of large language models. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pa...

2025
[64]

The llama 3 herd of models.CoRR, abs/2407.21783, 2024

Llama Team. The llama 3 herd of models.CoRR, abs/2407.21783, 2024

Pith/arXiv arXiv 2024
[65]

Downloads

Qwen Team. Qwen3 technical report.CoRR, abs/2505.09388, 2025. 14 Appendix Contents A Backbone LLMs 17 B Prior Detector Settings 17 C Dataset Processing, Prompt Construction, and Generation Pipeline 19 C.1 Coverage Compared with Representative Hallucination Benchmarks . . . . . . . . 19 C.2 Dataset Processing and Unified Schema . . . . . . . . . . . . . . ...

Pith/arXiv arXiv 2025
[66]

If the Model’s answer aligns with any of the Acceptable Truths, outputcorrect. 23
[67]

If the Model’s answer aligns with any Known Traps or introduces fabricated facts, output hallucination
[68]

reasoning

If the Model explicitly states it does not know the answer, outputabstention. Provide a brief reasoning, then select the category. User input format: Context: <optional context> Question: <question> Acceptable Truths: <list of acceptable answers> Known Traps: <optional list of known incorrect answers> Model Answer: <generated response> Required structured...

2048

[1] [1]

A survey of large language models, 2026

Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, Yifan Du, Chen Yang, Yushuo Chen, Zhipeng Chen, Jinhao Jiang, Ruiyang Ren, Yifan Li, Xinyu Tang, Zikang Liu, Peiyu Liu, Jian-Yun Nie, and Ji-Rong Wen. A survey of large language models, 2026

2026

[2] [2]

Enhancing hallucination detection through noise injection.CoRR, abs/2502.03799, 2025

Litian Liu, Reza Pourreza, Sunny Panchal, Apratim Bhattacharyya, Ya-Qin Zhang, and Roland Memisevic. Enhancing hallucination detection through noise injection.CoRR, abs/2502.03799, 2025

Pith/arXiv arXiv 2025

[3] [3]

Steer LLM latents for hallucination detection

Seongheon Park, Xuefeng Du, Min-Hsuan Yeh, Haobo Wang, and Yixuan Li. Steer LLM latents for hallucination detection. InForty-second International Conference on Machine Learning, 2025

2025

[4] [4]

Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension.ACL, 2017

Mandar Joshi, Eunsol Choi, Daniel S Weld, and Luke Zettlemoyer. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension.ACL, 2017

2017

[5] [5]

Truthfulqa: Measuring how models mimic human falsehoods.ACL, 2022

Stephanie Lin, Jacob Hilton, and Owain Evans. Truthfulqa: Measuring how models mimic human falsehoods.ACL, 2022

2022

[6] [6]

don’t forget the teachers

Emma Harvey, Allison Koenecke, and René F. Kizilcec. "don’t forget the teachers": Towards an educator-centered understanding of harms from large language models in education. In Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems, pages 1–19, 2025

2025

[7] [7]

The clinicians’ guide to large language models: A general perspective with a focus on hallucinations.Interactive Journal of Medical Research, 14:e59823, 2025

Dimitri Roustan and François Bastardot. The clinicians’ guide to large language models: A general perspective with a focus on hallucinations.Interactive Journal of Medical Research, 14:e59823, 2025

2025

[8] [8]

Deficiency of large language models in finance: An empirical examination of hallucination.arXiv preprint arXiv:2311.15548, 2023

Haoqiang Kang and Xiao-Yang Liu. Deficiency of large language models in finance: An empirical examination of hallucination.arXiv preprint arXiv:2311.15548, 2023

arXiv 2023

[9] [9]

On faithfulness and factuality in abstractive summarization

Joshua Maynez, Shashi Narayan, Bernd Bohnet, and Ryan McDonald. On faithfulness and factuality in abstractive summarization. InProceedings of the 58th annual meeting of the association for computational linguistics, pages 1906–1919, 2020

1906

[10] [10]

Adam Tauman Kalai and Santosh S. Vempala. Calibrated language models must hallucinate. In Proceedings of the 56th Annual ACM Symposium on Theory of Computing, STOC 2024, page 160–171, New York, NY , USA, 2024. Association for Computing Machinery

2024

[11] [11]

Llms will always hallucinate, and we need to live with this

Sourav Banerjee, Ayushi Agarwal, and Saloni Singla. Llms will always hallucinate, and we need to live with this. InIntelligent Systems Conference, pages 624–648. Springer, 2025

2025

[12] [12]

Why language models hallucinate.arXiv preprint arXiv:2509.04664, 2025

Adam Tauman Kalai, Ofir Nachum, Santosh S Vempala, and Edwin Zhang. Why language models hallucinate.arXiv preprint arXiv:2509.04664, 2025

Pith/arXiv arXiv 2025

[13] [13]

Adam Tauman Kalai and Santosh S. Vempala. Calibrated language models must hallucinate. In Proceedings of the 56th Annual ACM Symposium on Theory of Computing, 2024

2024

[14] [14]

On the limits of language generation: Trade-offs between hallucination and mode-collapse

Alkis Kalavasis, Anay Mehrotra, and Grigoris Velegkas. On the limits of language generation: Trade-offs between hallucination and mode-collapse. InProceedings of the 57th Annual ACM Symposium on Theory of Computing, 2025

2025

[15] [15]

Out-of-distribution detection and selective generation for conditional language models.arXiv preprint arXiv:2209.15558, 2022

Jie Ren, Jiaming Luo, Yao Zhao, Kundan Krishna, Mohammad Saleh, Balaji Lakshminarayanan, and Peter J Liu. Out-of-distribution detection and selective generation for conditional language models.arXiv preprint arXiv:2209.15558, 2022. 10

arXiv 2022

[16] [16]

Halueval: A large-scale hallucination evaluation benchmark for large language models

Junyi Li, Xiaoxue Cheng, Wayne Xin Zhao, Jian-Yun Nie, and Ji-Rong Wen. Halueval: A large-scale hallucination evaluation benchmark for large language models. InProceedings of the 2023 conference on empirical methods in natural language processing, pages 6449–6464, 2023

2023

[17] [17]

SelfcheckGPT: Zero-resource black-box hallucination detection for generative large language models

Potsawee Manakul, Adian Liusie, and Mark Gales. SelfcheckGPT: Zero-resource black-box hallucination detection for generative large language models. InThe 2023 Conference on Empirical Methods in Natural Language Processing, 2023

2023

[18] [18]

Detecting hallucinations in large language models using semantic entropy.Nature, 630:625 – 630, 2024

Sebastian Farquhar, Jannik Kossen, Lorenz Kuhn, and Yarin Gal. Detecting hallucinations in large language models using semantic entropy.Nature, 630:625 – 630, 2024

2024

[19] [19]

A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions.ACM Trans

Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qiang- long Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, and Ting Liu. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions.ACM Trans. Inf. Syst., 2025

2025

[20] [20]

Siren’s song in the AI ocean: A survey on hallucination in large language models

Yue Zhang, Yafu Li, Leyang Cui, Deng Cai, Lemao Liu, Tingchen Fu, Xinting Huang, Enbo Zhao, Yu Zhang, Yulong Chen, Longyue Wang, Anh Tuan Luu, Wei Bi, Freda Shi, and Shuming Shi. Siren’s song in the AI ocean: A survey on hallucination in large language models. Computational Linguistics, 2025

2025

[21] [21]

Lin, Jacob Hilton, and Owain Evans

Stephanie C. Lin, Jacob Hilton, and Owain Evans. Teaching models to express their uncertainty in words.Trans. Mach. Learn. Res., 2022, 2022

2022

[22] [22]

Generating with confidence: Uncertainty quan- tification for black-box large language models.Transactions on Machine Learning Research, 2024

Zhen Lin, Shubhendu Trivedi, and Jimeng Sun. Generating with confidence: Uncertainty quan- tification for black-box large language models.Transactions on Machine Learning Research, 2024

2024

[23] [23]

Out-of-distribution detection and selective generation for conditional language models

Jie Ren, Jiaming Luo, Yao Zhao, Kundan Krishna, Mohammad Saleh, Balaji Lakshminarayanan, and Peter J Liu. Out-of-distribution detection and selective generation for conditional language models. InThe Eleventh International Conference on Learning Representations, 2023

2023

[24] [24]

Uncertainty estimation in autoregressive structured prediction

Andrey Malinin and Mark Gales. Uncertainty estimation in autoregressive structured prediction. InInternational Conference on Learning Representations, 2021

2021

[25] [25]

Shifting attention to relevance: Towards the predictive uncertainty quantification of free-form large language models

Jinhao Duan, Hao Cheng, Shiqi Wang, Alex Zavalny, Chenan Wang, Renjing Xu, Bhavya Kailkhura, and Kaidi Xu. Shifting attention to relevance: Towards the predictive uncertainty quantification of free-form large language models. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computa- ti...

2024

[26] [26]

INSIDE: LLMs’ internal states retain the power of hallucination detection

Chao Chen, Kai Liu, Ze Chen, Yi Gu, Yue Wu, Mingyuan Tao, Zhihang Fu, and Jieping Ye. INSIDE: LLMs’ internal states retain the power of hallucination detection. InThe Twelfth International Conference on Learning Representations, 2024

2024

[27] [27]

Discovering latent knowledge in language models without supervision

Collin Burns, Haotian Ye, Dan Klein, and Jacob Steinhardt. Discovering latent knowledge in language models without supervision. InThe Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023

2023

[28] [28]

The internal state of an LLM knows when it’s lying

Amos Azaria and Tom Mitchell. The internal state of an LLM knows when it’s lying. InThe 2023 Conference on Empirical Methods in Natural Language Processing, 2023

2023

[29] [29]

Mitigat- ing llm hallucinations via conformal abstention.arXiv preprint arXiv:2405.01563, 2024

Yasin Abbasi Yadkori, Ilja Kuzborskij, David Stutz, András György, Adam Fisch, Arnaud Doucet, Iuliya Beloshapka, Wei-Hung Weng, Yao-Yuan Yang, Csaba Szepesvári, et al. Mitigat- ing llm hallucinations via conformal abstention.arXiv preprint arXiv:2405.01563, 2024

arXiv 2024

[30] [30]

A framework to assess clinical safety and hallucination rates of LLMs for medical text summarisation.npj Digital Medicine, 8(1):274, 2025

Elham Asgari, Nina Montaña-Brown, Magda Dubois, Saleh Khalil, Jasmine Balloch, Joshua Au Yeung, and Dominic Pimenta. A framework to assess clinical safety and hallucination rates of LLMs for medical text summarisation.npj Digital Medicine, 8(1):274, 2025. 11

2025

[31] [31]

RedeEP: Detecting hallucination in retrieval-augmented generation via mechanistic interpretability

ZhongXiang Sun, Xiaoxue Zang, Kai Zheng, Jun Xu, Xiao Zhang, Weijie Yu, Yang Song, and Han Li. RedeEP: Detecting hallucination in retrieval-augmented generation via mechanistic interpretability. InThe Thirteenth International Conference on Learning Representations, 2025

2025

[32] [32]

Spectral guardrails for agents in the wild: Detecting tool use hallucinations via attention topology.arXiv preprint arXiv:2602.08082, 2026

Valentin Noël. Spectral guardrails for agents in the wild: Detecting tool use hallucinations via attention topology.arXiv preprint arXiv:2602.08082, 2026

arXiv 2026

[33] [33]

The illusion of progress: Re-evaluating hallucination detection in llms

Denis Janiak, Jakub Binkowski, Albert Sawczyn, Bogdan Gabrys, Ravid Shwartz-Ziv, and Tomasz Kajdanowicz. The illusion of progress: Re-evaluating hallucination detection in llms. InConference on Empirical Methods in Natural Language Processing, 2025

2025

[34] [34]

HalluLens: LLM hallucination benchmark

Yejin Bang, Ziwei Ji, Alan Schelten, Anthony Hartshorn, Tara Fowler, Cheng Zhang, Nicola Cancedda, and Pascale Fung. HalluLens: LLM hallucination benchmark. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), July 2025

2025

[35] [35]

Hallumix: A task-agnostic, multi-domain benchmark for real-world hallucination detection

Deanna Emery, Michael Goitia, Freddie Vargus, and Iulia Neagu. Hallumix: A task-agnostic, multi-domain benchmark for real-world hallucination detection. 2025

2025

[36] [36]

ICR probe: Tracking hidden state dynamics for reliable hallucination detection in LLMs

Zhenliang Zhang, Xinyu Hu, Huixuan Zhang, Junzhe Zhang, and Xiaojun Wan. ICR probe: Tracking hidden state dynamics for reliable hallucination detection in LLMs. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Pape...

2025

[37] [37]

Evaluating evaluation metrics — the mirage of hallucination detection

Atharva Kulkarni*, Yuan Zhang, Joel Ruben Antony Moniz*, Xiou Ge, Bo-Hsiang Tseng, Dhivya Piraviperumal, Swabha Swayamdipta, and Hong Yu. Evaluating evaluation metrics — the mirage of hallucination detection. InEMNLP, 2025

2025

[38] [38]

HalluCounter: Reference-free LLM hallucination detection in the wild! In Kentaro Inui, Sakriani Sakti, Haofen Wang, Derek F

Ashok Urlana, Gopichand Kanumolu, Charaka Vinayak Kumar, Bala Mallikarjunarao Garlapati, and Rahul Mishra. HalluCounter: Reference-free LLM hallucination detection in the wild! In Kentaro Inui, Sakriani Sakti, Haofen Wang, Derek F. Wong, Pushpak Bhattacharyya, Biplab Banerjee, Asif Ekbal, Tanmoy Chakraborty, and Dhirendra Pratap Singh, editors,Proceedings...

2025

[39] [39]

Haloscope: Harnessing unlabeled LLM generations for hallucination detection

Xuefeng Du, Chaowei Xiao, and Yixuan Li. Haloscope: Harnessing unlabeled LLM generations for hallucination detection. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

2024

[40] [40]

Beyond in-domain detection: Spikescore for cross-domain hallucination detection

Yongxin Deng, Zhen Fang, Sharon Li, and Ling Chen. Beyond in-domain detection: Spikescore for cross-domain hallucination detection. InThe Fourteenth International Conference on Learning Representations, 2026

2026

[41] [41]

Halluentity: Benchmarking and understanding entity-level hallucination detection.Transactions on Machine Learning Research, 2025

Min-Hsuan Yeh, Max Kamachee, Seongheon Park, and Yixuan Li. Halluentity: Benchmarking and understanding entity-level hallucination detection.Transactions on Machine Learning Research, 2025

2025

[42] [42]

FActscore: Fine-grained atomic evaluation of factual precision in long form text generation

Sewon Min, Kalpesh Krishna, Xinxi Lyu, Mike Lewis, Wen tau Yih, Pang Wei Koh, Mohit Iyyer, Luke Zettlemoyer, and Hannaneh Hajishirzi. FActscore: Fine-grained atomic evaluation of factual precision in long form text generation. InEMNLP, 2023

2023

[43] [43]

Think you have solved question answering? try arc, the ai2 reasoning challenge.ArXiv, abs/1803.05457, 2018

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge.ArXiv, abs/1803.05457, 2018

Pith/arXiv arXiv 2018

[44] [44]

CommonsenseQA: A question answering challenge targeting commonsense knowledge

Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. CommonsenseQA: A question answering challenge targeting commonsense knowledge. In Jill Burstein, Christy Doran, and Thamar Solorio, editors,Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volum...

2019

[45] [45]

Know what you don’t know: Unanswerable questions for SQuAD

Pranav Rajpurkar, Robin Jia, and Percy Liang. Know what you don’t know: Unanswerable questions for SQuAD. In Iryna Gurevych and Yusuke Miyao, editors,Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Melbourne, Australia, July 2018. Association for Computational Linguistics

2018

[46] [46]

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. HotpotQA: A dataset for diverse, explainable multi-hop question answering. In Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun’ichi Tsujii, editors, Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processi...

2018

[47] [47]

Siva Reddy, Danqi Chen, and Christopher D. Manning. CoQA: A conversational question answering challenge.Transactions of the Association for Computational Linguistics, 7, 2019

2019

[48] [48]

RAGTruth: A hallucination corpus for developing trustworthy retrieval- augmented language models

Cheng Niu, Yuanhao Wu, Juno Zhu, Siliang Xu, KaShun Shum, Randy Zhong, Juntong Song, and Tong Zhang. RAGTruth: A hallucination corpus for developing trustworthy retrieval- augmented language models. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors, Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: ...

2024

[49] [49]

Cohen, and Mirella Lapata

Shashi Narayan, Shay B. Cohen, and Mirella Lapata. Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization. In Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun’ichi Tsujii, editors,Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, Octobe...

2018

[50] [50]

Training verifiers to solve math word problems.ArXiv, abs/2110.14168, 2021

Karl Cobbe, Vineet Kosaraju, Mo Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems.ArXiv, abs/2110.14168, 2021

Pith/arXiv arXiv 2021

[51] [51]

Arkil Patel, Satwik Bhattamishra, and Navin Goyal. Are NLP models really able to solve simple math word problems? In Kristina Toutanova, Anna Rumshisky, Luke Zettlemoyer, Dilek Hakkani-Tur, Iz Beltagy, Steven Bethard, Ryan Cotterell, Tanmoy Chakraborty, and Yichao Zhou, editors,Proceedings of the 2021 Conference of the North American Chapter of the Associ...

2021

[52] [52]

Association for Computational Linguistics

[53] [53]

TheoremQA: A theorem-driven question answering dataset

Wenhu Chen, Ming Yin, Max Ku, Pan Lu, Yixin Wan, Xueguang Ma, Jianyu Xu, Xinyi Wang, and Tony Xia. TheoremQA: A theorem-driven question answering dataset. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Singapore, December 2023. Association for Computational Linguistics

2023

[54] [54]

Cummings, Matthias Plappert, Fotios Chantzis, Elizabeth Barnes, Ariel Herbert-V oss, William H

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Pondé, Jared Kaplan, Harrison Edwards, Yura Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mo Bavarian, Clemens Winter, Phi...

Pith/arXiv arXiv 2021

[55] [55]

Cai, Michael Terry, Quoc V

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie J. Cai, Michael Terry, Quoc V . Le, and Charles Sutton. Program synthesis with large language models.ArXiv, 2021

2021

[56] [56]

xLAM: A family of large action models to empower 13 AI agent systems

Jianguo Zhang, Tian Lan, Ming Zhu, Zuxin Liu, Thai Hoang, Shirley Kokane, Weiran Yao, Juntao Tan, Zhiwei Liu, Yihao Feng, Juan Carlos Niebles, Shelby Heinecke, Huan Wang, Silvio Savarese, and Caiming Xiong. xLAM: A family of large action models to empower 13 AI agent systems. In Luis Chiruzzo, Alan Ritter, and Lu Wang, editors,Proceedings of the 2025 Conf...

2025

[57] [57]

The belebele benchmark: a parallel reading comprehension dataset in 122 language variants

Lucas Bandarkar, Davis Liang, Benjamin Muller, Mikel Artetxe, Satya Narayan Shukla, Donald Husa, Naman Goyal, Abhinandan Krishnan, Luke Zettlemoyer, and Madian Khabsa. The belebele benchmark: a parallel reading comprehension dataset in 122 language variants. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedings of the 62nd Annual Meeting o...

2024

[58] [58]

Wong, and Rui Wang

Yiming Wang, Pei Zhang, Baosong Yang, Derek F. Wong, and Rui Wang. Latent space chain-of-embedding enables output-free LLM self-evaluation. InThe Thirteenth International Conference on Learning Representations, 2025

2025

[59] [59]

Xing, Hao Zhang, Joseph E

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena. InProceedings of the 37th International Conference on Neural Information Processing Systems, NIPS ’23, Red Hook, NY , USA...

2023

[60] [60]

Brown, Jack Clark, Nicholas Joseph, Benjamin Mann, Sam McCandlish, Chris Olah, and Jared Kaplan

Saurav Kadavath, Tom Conerly, Amanda Askell, Thomas Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zachary Dodds, Nova Dassarma, Eli Tran-Johnson, Scott Johnston, Sheer El-Showk, Andy Jones, Nelson Elhage, Tristan Hume, Anna Chen, Yuntao Bai, Sam Bowman, Stanislav Fort, Deep Ganguli, Danny Hernandez, Josh Jacobson, John Kernion, Shauna Kravec, Lian...

Pith/arXiv arXiv 2022

[61] [61]

Unsupervised real-time hallucination detection based on the internal states of large language models

Weihang Su, Changyue Wang, Qingyao Ai, Yiran Hu, Zhijing Wu, Yujia Zhou, and Yiqun Liu. Unsupervised real-time hallucination detection based on the internal states of large language models. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Findings of the Association for Computational Linguistics: ACL 2024, Bangkok, Thailand, August 2024. Associat...

2024

[62] [62]

Semantic entropy probes: Robust and cheap hallucination detection in LLMs, 2025

Jannik Kossen, Jiatong Han, Muhammed Razzak, Lisa Schut, Shreshth A Malik, and Yarin Gal. Semantic entropy probes: Robust and cheap hallucination detection in LLMs, 2025

2025

[63] [63]

Prompt-guided internal states for hallucination detection of large language models

Fujie Zhang, Peiqi Yu, Biao Yi, Baolei Zhang, Tong Li, and Zheli Liu. Prompt-guided internal states for hallucination detection of large language models. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pa...

2025

[64] [64]

The llama 3 herd of models.CoRR, abs/2407.21783, 2024

Llama Team. The llama 3 herd of models.CoRR, abs/2407.21783, 2024

Pith/arXiv arXiv 2024

[65] [65]

Downloads

Qwen Team. Qwen3 technical report.CoRR, abs/2505.09388, 2025. 14 Appendix Contents A Backbone LLMs 17 B Prior Detector Settings 17 C Dataset Processing, Prompt Construction, and Generation Pipeline 19 C.1 Coverage Compared with Representative Hallucination Benchmarks . . . . . . . . 19 C.2 Dataset Processing and Unified Schema . . . . . . . . . . . . . . ...

Pith/arXiv arXiv 2025

[66] [66]

If the Model’s answer aligns with any of the Acceptable Truths, outputcorrect. 23

[67] [67]

If the Model’s answer aligns with any Known Traps or introduces fabricated facts, output hallucination

[68] [68]

reasoning

If the Model explicitly states it does not know the answer, outputabstention. Provide a brief reasoning, then select the category. User input format: Context: <optional context> Question: <question> Acceptable Truths: <list of acceptable answers> Known Traps: <optional list of known incorrect answers> Model Answer: <generated response> Required structured...

2048