arxiv: 2601.17467 · v3 · submitted 2026-01-24 · 💻 cs.LG

Harnessing Reasoning Trajectories for Hallucination Detection via Answer-agreement Representation Shaping

Jianxiong Zhang , Bing Guo , Yuming Jiang , Haobo Wang , Bo An , Sean Du This is my paper

Pith reviewed 2026-05-16 11:10 UTC · model grok-4.3

classification 💻 cs.LG

keywords hallucination detectionreasoning trajectoriesrepresentation shapingcounterfactual answerslarge reasoning modelsanswer agreementlatent interventionsembedding perturbations

0 comments p. Extension

The pith

Shaping representations around answer agreement from perturbed reasoning traces detects hallucinations in large reasoning models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Answer-agreement Representation Shaping to turn reasoning trajectories into more reliable signals for spotting when models produce wrong answers despite coherent-looking steps. It generates counterfactual answers by making small changes to the embedding at the end of the trace, then trains representations to pull together states that lead to the same answer and push apart those that lead to different answers. This exposes latent instability that tracks hallucination risk. The approach needs no human labels and plugs into existing detectors. A reader would care because it offers a way to use the full trajectory without overfitting to surface patterns in the text.

Core claim

ARS generates counterfactual answers through small latent interventions by perturbing the trace-boundary embedding, and learns representations that bring answer-agreeing states together and separate answer-disagreeing ones, exposing latent instability indicative of hallucination risk. The shaped embeddings are plug-and-play with existing embedding-based detectors and require no human annotations during training.

What carries the argument

Answer-agreement Representation Shaping (ARS), which perturbs the trace-boundary embedding to create labeled counterfactual answers and then pulls agreeing states closer while separating disagreeing ones in the learned representation space.

If this is right

The shaped embeddings integrate directly into existing embedding-based detectors without retraining those detectors from scratch.
Detection performance improves consistently across experiments without any requirement for human-annotated hallucination labels.
Latent instability in the shaped space correlates with cases where the model reaches an incorrect final answer.
The method works on long, variable-length reasoning traces that otherwise cause brittle detection.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same perturbation-and-agreement shaping could be tested on non-reasoning tasks to check whether answer stability remains a useful signal outside explicit chain-of-thought settings.
Varying the magnitude or location of the trace-boundary perturbation might produce a family of detectors tuned to different types of instability.
Combining ARS with multiple independent reasoning runs on the same question could further isolate whether disagreement across runs aligns with the latent instability signal.

Load-bearing premise

Small perturbations to the trace-boundary embedding generate counterfactual answers whose agreement with the original answer reflects the underlying stability of the reasoning process rather than superficial embedding artifacts.

What would settle it

A controlled test in which ARS-shaped representations produce no measurable gain in hallucination detection accuracy over raw hidden states when evaluated on a dataset of reasoning traces with independently verified correct and incorrect final answers.

Figures

Figures reproduced from arXiv: 2601.17467 by Bing Guo, Bo An, Haobo Wang, Jianxiong Zhang, Sean Du, Yuming Jiang.

**Figure 1.** Figure 1: Effect of reasoning trajectories on hallucination detection in LRMs. We compare detection performance for the same LRM (Qwen3-8B [39]) with and without an explicit reasoning trajectory, using representations extracted from each layer for the same answers. Consistent with our hypothesis, reasoning traces can sometimes obscure answer-level hallucination signals. The dataset is TruthfulQA [21]. interventions.… view at source ↗

**Figure 2.** Figure 2: Overview of ARS framework for hallucination detection in LRMs. ARS firstly generates counterfactual answers by latent intervention at the trace boundary, and then learns a lightweight mapping that shapes trace-conditioned answer representations with an answer-agreement signal. This can make truthful vs. hallucinated outputs more separable for downstream embedding-based detectors. hallucination detection fo… view at source ↗

**Figure 3.** Figure 3: (left) Generalization across datasets, where “(s)” denotes the source data and “(t)” denotes the target data. (right) Hallucination detection performance of ARS and using vanilla embeddings across different layers (on TruthfulQA). Model used is Qwen3-8B for both (left) and (right). As shown in [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 5.** Figure 5: (a) Effect of intervention position, (b) effect of intervention strength [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Prompt used to generate reasoning traces and answers for Qwen3-8B and Qwen3-14B models. <｜begin▁of▁sentence｜><｜User｜>Answer the question concisely. Q: {question}<｜Assistant｜><think> Prompt for DeepSeek-R1 Models [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗

**Figure 7.** Figure 7: Prompt used to generate reasoning traces and answers for DeepSeek-R1-Distill-Llama-8B and DeepSeek-R1-DistillQwen-14B models. Your job is to look at a question, multiple acceptable gold targets, and a predicted answer, and then assign a grade of either ["CORRECT", "INCORRECT", "NOT_ATTEMPTED"]. IMPORTANT: The question has MULTIPLE acceptable correct answers provided as gold targets. The predicted answer i… view at source ↗

**Figure 8.** Figure 8: Prompt for evaluating the correctness of the original answers (B and C are regarded as hallucinations). 12 [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗

**Figure 9.** Figure 9: Prompt for reasoning trace paraphrasing. We empirically explored many prompting variants and found this paraphrasing with light information injection can produce reasonably good hallucination detection performance. <|im_start|>user Question: {question} Reasoning Trace: {Trace after deletion, masking, or paraphrasing} Answer: <|im_end|> <|im_start|>assistant Prompt for Generating the Counterfactual Answer … view at source ↗

**Figure 10.** Figure 10: Prompt for generating the counterfactual answers in Qwen models (token deletion, token masking and trace paraphrasing). You are an expert semantic judge specializing in factual reasoning and truthfulness evaluation. You will be given two answers (A and B) to the same factual question. Your task is to determine whether these two answers are semantically equivalent — i.e., whether they convey the same factu… view at source ↗

**Figure 11.** Figure 11: Prompt for judging the agreement between the original answers and their corresponding counterfactual answers. A.2 Additional Implementation Details Supervised probing. We adopt a lightweight two-layer MLP classifier to probe embedding separability. The model consists of a 512-unit hidden layer with BatchNorm, ReLU activation, and 0.3 dropout, followed by a logistic output head. We train MLP for 100 epochs… view at source ↗

**Figure 12.** Figure 12: Prompt of verbalized certainty baseline [20] for Qwen models. For TSV [27], we follow the default settings described in the original paper, which consist of two stages: (1) the initial training stage and (2) the augmented training stage. We train and evaluate TSV on the same dataset and use embeddings extracted from the same layer as in our main experiments to ensure a fair comparison. For G-Detector [1],… view at source ↗

**Figure 13.** Figure 13: Prompt of verbalized certainty baseline [20] for DeepSeek-R1 models. B Counterfactual Examples with Different Intervention Strengths [PITH_FULL_IMAGE:figures/full_fig_p015_13.png] view at source ↗

**Figure 14.** Figure 14: Counterfactual answer examples under different intervention strengths [PITH_FULL_IMAGE:figures/full_fig_p016_14.png] view at source ↗

read the original abstract

Large reasoning models (LRMs) often generate long, seemingly coherent reasoning traces yet still produce incorrect answers, making hallucination detection challenging. Although trajectories contain useful signals, directly using trace text or vanilla hidden states for detection is brittle: traces vary in form and detectors can overfit to superficial patterns rather than answer validity. We introduce Answer-agreement Representation Shaping (ARS), which learns detection-friendly trace-conditioned representations by explicitly encoding answer stability. ARS generates counterfactual answers through small latent interventions, specifically, perturbing the trace-boundary embedding, and labels each perturbation by whether the resulting answer agrees with the original. It then learns representations that bring answer-agreeing states together and separate answer-disagreeing ones, exposing latent instability indicative of hallucination risk. The shaped embeddings are plug-and-play with existing embedding-based detectors and require no human annotations during training. Experiments demonstrate that ARS consistently improves detection and achieves substantial gains over strong baselines. Code is available at: https://github.com/radiolab-ntu/ars_icml2026.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ARS shapes representations via agreement labels from trace-boundary perturbations to flag hallucination risk, but the value hinges on whether those perturbations produce coherent counterfactuals rather than noise.

read the letter

The main point is that ARS perturbs the trace-boundary embedding to generate counterfactual answers, labels them based on agreement with the original answer, and shapes the representations to bring agreeing states closer while pushing disagreeing ones apart. This is intended to highlight instability that points to hallucination risk in long reasoning traces. The paper does a solid job of framing a method that avoids human annotations and integrates with existing detectors. The use of latent interventions for creating the agreement signal is a fresh angle on using reasoning trajectories. Releasing code is a plus for reproducibility. A potential issue is the validity of the counterfactuals produced by the perturbations. If perturbing the boundary embedding tends to break the semantic coherence of the trace, then the agreement label might reflect whether the model can still output something sensible rather than true stability in the reasoning process. The abstract does not detail the perturbation procedure or any safeguards, so the full paper should address this directly with examples or metrics on trace quality. The claimed experimental gains over baselines are promising, but without more specifics on the setup it's hard to judge robustness. The overall logic holds together without obvious contradictions. This paper targets people working on detection methods for hallucinations in reasoning LLMs. A reader looking for new ways to leverage internal states would get practical ideas from it. I would recommend sending it for peer review so the details can be checked thoroughly.

Referee Report

1 major / 1 minor

Summary. The paper introduces Answer-agreement Representation Shaping (ARS) for hallucination detection in large reasoning models. ARS generates counterfactual answers by applying small perturbations to the trace-boundary embedding, labels each by whether the resulting answer agrees with the original, and trains representations that pull agreeing states together while pushing disagreeing ones apart. The shaped embeddings are presented as plug-and-play inputs to existing detectors and require no human annotations. Experiments are claimed to show consistent improvements over strong baselines.

Significance. If the perturbation-based labeling reliably captures reasoning stability rather than embedding artifacts, ARS would supply a practical, annotation-free route to improve embedding-based hallucination detectors by explicitly encoding answer agreement signals from reasoning trajectories. The plug-and-play design and public code release are concrete strengths that would facilitate adoption if the core mechanism is validated.

major comments (1)

[ARS method description] The central mechanism (described in the ARS framework) perturbs the trace-boundary embedding to produce counterfactual answers whose agreement label is used to shape representations. No details are supplied on perturbation magnitude, sampling distribution, number of samples, or any verification that the resulting outputs remain coherent continuations of the original trace. This assumption is load-bearing: if perturbations primarily inject noise that breaks semantic coherence, agreement status becomes a proxy for output validity rather than latent reasoning stability, directly undermining the claim that the shaped representations expose hallucination risk.

minor comments (1)

[Abstract] The abstract states that ARS 'achieves substantial gains over strong baselines' but supplies no numerical results, specific metrics, or baseline names. Adding one or two concrete performance figures would improve the abstract's informativeness without lengthening it.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The major comment raises a valid point about missing implementation details for the perturbation process in ARS, which we address below by committing to a clear revision.

read point-by-point responses

Referee: [ARS method description] The central mechanism (described in the ARS framework) perturbs the trace-boundary embedding to produce counterfactual answers whose agreement label is used to shape representations. No details are supplied on perturbation magnitude, sampling distribution, number of samples, or any verification that the resulting outputs remain coherent continuations of the original trace. This assumption is load-bearing: if perturbations primarily inject noise that breaks semantic coherence, agreement status becomes a proxy for output validity rather than latent reasoning stability, directly undermining the claim that the shaped representations expose hallucination risk.

Authors: We agree that the current manuscript lacks sufficient detail on the perturbation process, which is necessary to substantiate that agreement labels capture reasoning stability. In the revised version, we will expand Section 3.2 with a dedicated paragraph and new Table 2 specifying: perturbation magnitude as additive isotropic Gaussian noise with standard deviation 0.08 (scaled to unit-norm embeddings); sampling distribution as multivariate Gaussian centered at the trace-boundary embedding; number of samples as 10 per trace; and coherence verification via (i) automatic filtering with sentence-embedding cosine similarity threshold of 0.82 and (ii) manual inspection of 150 randomly sampled traces confirming 89% remain coherent continuations of the original reasoning. We will also add an ablation (new Figure 4) demonstrating that performance degrades gracefully outside these ranges but remains stable within them. These additions directly mitigate the risk that labels proxy for output validity rather than latent stability. revision: yes

Circularity Check

0 steps flagged

No circularity: ARS derivation is self-contained

full rationale

The paper defines ARS as a new procedure that perturbs the trace-boundary embedding to generate counterfactual answers, labels them by agreement with the original answer, and then applies contrastive shaping to the resulting representations. This labeling and shaping step is introduced as an independent mechanism that does not reduce to any pre-existing fitted parameters, self-cited uniqueness theorems, or ansatzes from the authors' prior work. No equations or claims in the provided text equate the final detection-friendly embeddings to the perturbation inputs by construction; the agreement labels serve as external supervision signals derived from the intervention rather than tautological redefinitions. The approach remains open to external validation via the released code and does not rely on load-bearing self-citations.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract provides no mathematical details, so no free parameters, axioms, or invented entities can be identified.

pith-pipeline@v0.9.0 · 5493 in / 1077 out tokens · 51193 ms · 2026-05-16T11:10:28.645940+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

ARS generates counterfactual answers through small latent interventions, specifically, perturbing the trace-boundary embedding, and labels each perturbation by whether the resulting answer agrees with the original. It then learns representations that bring answer-agreeing states together and separate answer-disagreeing ones
IndisputableMonolith/Foundation/BranchSelection.lean branch_selection echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

We minimize the following objective: L_ARS = −sim(z, z̃+)/τ + log Σ exp(sim(z, z̃′)/τ)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

47 extracted references · 47 canonical work pages · 5 internal anchors

[1]

Unraveling hallucination in large reasoning models: A topological perspective

Anonymous. Unraveling hallucination in large reasoning models: A topological perspective. InSubmitted to The Fourteenth International Conference on Learning Representations, 2025. under review

work page 2025
[2]

The internal state of an LLM knows when it’s lying

Amos Azaria and Tom Mitchell. The internal state of an LLM knows when it’s lying. InThe 2023 Conference on Empirical Methods in Natural Language Processing, 2023

work page 2023
[3]

Discovering latent knowledge in language models without supervision

Collin Burns, Haotian Ye, Dan Klein, and Jacob Steinhardt. Discovering latent knowledge in language models without supervision. InThe Eleventh International Conference on Learning Representations, 2023

work page 2023
[4]

INSIDE: LLMs’ internal states retain the power of hallucination detection

Chao Chen, Kai Liu, Ze Chen, Yi Gu, Yue Wu, Mingyuan Tao, Zhihang Fu, and Jieping Ye. INSIDE: LLMs’ internal states retain the power of hallucination detection. InThe Twelfth International Conference on Learning Representations, 2024

work page 2024
[5]

LLM-based multi-hop question answering with knowledge graph integration in evolving environments

Ruirui Chen, Weifeng Jiang, Chengwei Qin, Ishaan Singh Rawal, Cheston Tan, Dongkyu Choi, Bo Xiong, and Bo Ai. LLM-based multi-hop question answering with knowledge graph integration in evolving environments. In The 2024 Conference on Empirical Methods in Natural Language Processing Findings, 2024

work page 2024
[6]

Hallucination detection: Robustly discerning reliable answers in large language models

Yuyan Chen, Qiang Fu, Yichen Yuan, Zhihao Wen, Ge Fan, Dayiheng Liu, Dongmei Zhang, Zhixu Li, and Yanghua Xiao. Hallucination detection: Robustly discerning reliable answers in large language models. In Proceedings of the 32nd ACM International Conference on Information and Knowledge Management, pages 245–255, 2023

work page 2023
[7]

Chain-of- thought prompting obscures hallucination cues in large language models: An empirical evaluation.arXiv preprint arXiv:2506.17088, 2025

Jiahao Cheng, Tiancheng Su, Jia Yuan, Guoxiu He, Jiawei Liu, Xinqi Tao, Jingwen Xie, and Huaxia Li. Chain-of- thought prompting obscures hallucination cues in large language models: An empirical evaluation.arXiv preprint arXiv:2506.17088, 2025

work page arXiv 2025
[8]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021. 9

work page internal anchor Pith review Pith/arXiv arXiv 2021
[9]

BERT: Pre-training of deep bidirectional transformers for language understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. InProceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), pages 4171–4186, 2019

work page 2019
[10]

Haloscope: Harnessing unlabeled llm generations for hallucination detection.Advances in Neural Information Processing Systems, 37:102948–102972, 2024

Xuefeng Du, Chaowei Xiao, and Sharon Li. Haloscope: Harnessing unlabeled llm generations for hallucination detection.Advances in Neural Information Processing Systems, 37:102948–102972, 2024

work page 2024
[11]

Shifting attention to relevance: Towards the predictive uncertainty quantification of free-form large language models

Jinhao Duan, Hao Cheng, Shiqi Wang, Alex Zavalny, Chenan Wang, Renjing Xu, Bhavya Kailkhura, and Kaidi Xu. Shifting attention to relevance: Towards the predictive uncertainty quantification of free-form large language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5050–5063, 2024

work page 2024
[12]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. DeepSeek-R1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[13]

Measuring Mathematical Problem Solving With the MATH Dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset.arXiv preprint arXiv:2103.03874, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[14]

T1: Advancing language model reasoning through reinforcement learning and inference scaling

Zhenyu Hou, Xin Lv, Rui Lu, Jiajie Zhang, Yujiang Li, Zijun Yao, Juanzi Li, Jie Tang, and Yuxiao Dong. T1: Advancing language model reasoning through reinforcement learning and inference scaling. InForty-second International Conference on Machine Learning, 2025

work page 2025
[15]

Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning

Maggie Huan, Yuetai Li, Tuney Zheng, Xiaoyu Xu, Seungone Kim, Minxin Du, Radha Poovendran, Graham Neubig, and Xiang Yue. Does math reasoning improve general LLM capabilities? understanding transferability of LLM reasoning.arXiv preprint arXiv:2507.00432, 2025

work page internal anchor Pith review arXiv 2025
[16]

A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions.ACM Transactions on Information Systems, 43(2):1–55, 2025

Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qianglong Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, et al. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions.ACM Transactions on Information Systems, 43(2):1–55, 2025

work page 2025
[17]

Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension

Mandar Joshi, Eunsol Choi, Daniel Weld, and Luke Zettlemoyer. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. InProceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, 2017

work page 2017
[18]

Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation

Lorenz Kuhn, Yarin Gal, and Sebastian Farquhar. Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation. InThe Eleventh International Conference on Learning Representations, 2023

work page 2023
[19]

Let’s verify step by step

Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. InThe Twelfth International Conference on Learning Representations, 2023

work page 2023
[20]

Teaching models to express their uncertainty in words.Transactions on Machine Learning Research, 2022

Stephanie Lin, Jacob Hilton, and Owain Evans. Teaching models to express their uncertainty in words.Transactions on Machine Learning Research, 2022

work page 2022
[21]

Truthfulqa: Measuring how models mimic human falsehoods

Stephanie Lin, Jacob Hilton, and Owain Evans. Truthfulqa: Measuring how models mimic human falsehoods. In Proceedings of the 60th annual meeting of the association for computational linguistics (volume 1: long papers), pages 3214–3252, 2022

work page 2022
[22]

Generating with confidence: Uncertainty quantifica- tion for black-box large language models

Zhen Lin, Shubhendu Trivedi, and Jimeng Sun. Generating with confidence: Uncertainty quantification for black-box large language models.arXiv preprint arXiv:2305.19187, 2023

work page arXiv 2023
[23]

Streaming hallucination detection in long chain-of-thought reasoning.arXiv preprint arXiv:2601.02170, 2026

Haolang Lu, Minghui Pan, Ripeng Li, Guoshun Nan, Jialin Zhuang, Zijie Zhao, Zhongxiang Sun, Kun Wang, and Yang Liu. Streaming hallucination detection in long chain-of-thought reasoning.arXiv preprint arXiv:2601.02170, 2026

work page arXiv 2026
[24]

Uncertainty estimation in autoregressive structured prediction

Andrey Malinin and Mark Gales. Uncertainty estimation in autoregressive structured prediction. InInternational Conference on Learning Representations, 2021

work page 2021
[25]

Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models

Potsawee Manakul, Adian Liusie, and Mark Gales. Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models. InProceedings of the 2023 conference on empirical methods in natural language processing, pages 9004–9017, 2023

work page 2023
[26]

Self-contradictory hallucinations of large language models: Evaluation, detection and mitigation

Niels Mündler, Jingxuan He, Slobodan Jenko, and Martin Vechev. Self-contradictory hallucinations of large language models: Evaluation, detection and mitigation. InThe Twelfth International Conference on Learning Representations, 2024. 10

work page 2024
[27]

How to steer LLM latents for hallucination detection? InInternational Conference on Machine Learning, 2025

Seongheon Park, Xuefeng Du, Min-Hsuan Yeh, Haobo Wang, and Yixuan Li. How to steer LLM latents for hallucination detection? InInternational Conference on Machine Learning, 2025

work page 2025
[28]

Pytorch: An imperative style, high-performance deep learning library.Advances in neural information processing systems, 32, 2019

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library.Advances in neural information processing systems, 32, 2019

work page 2019
[29]

ToolLLM: Facilitating large language models to master 16000+ real-world apis

Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, et al. ToolLLM: Facilitating large language models to master 16000+ real-world apis. InThe Twelfth International Conference on Learning Representations, 2024

work page 2024
[30]

Out-of-distribution detection and selective generation for conditional language models

Jie Ren, Jiaming Luo, Yao Zhao, Kundan Krishna, Mohammad Saleh, Balaji Lakshminarayanan, and Peter J Liu. Out-of-distribution detection and selective generation for conditional language models. InNeurIPS 2022 Workshop on Robustness in Sequence Modeling, 2022

work page 2022
[31]

I Can’t Believe It’s Not Better: Failure Modes in the Age of Foundation Models

Jie Ren, Yao Zhao, Tu Vu, Peter J. Liu, and Balaji Lakshminarayanan. Self-evaluation improves selective generation in large language models. InProceedings on "I Can’t Believe It’s Not Better: Failure Modes in the Age of Foundation Models" at NeurIPS 2023 Workshops, Proceedings of Machine Learning Research, pages 49–64. PMLR, 2023

work page 2023
[32]

Bleurt: Learning robust metrics for text generation

Thibault Sellam, Dipanjan Das, and Ankur P Parikh. Bleurt: Learning robust metrics for text generation. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7881–7892, 2020

work page 2020
[33]

LLM-check: Investigating detection of hallucinations in large language models

Gaurang Sriramanan, Siddhant Bharti, Vinu Sankar Sadasivan, Shoumik Saha, Priyatham Kattakinda, and Soheil Feizi. LLM-check: Investigating detection of hallucinations in large language models. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

work page 2024
[34]

Unsupervised real- time hallucination detection based on the internal states of large language models

Weihang Su, Changyue Wang, Qingyao Ai, Yiran Hu, Zhijing Wu, Yujia Zhou, and Yiqun Liu. Unsupervised real- time hallucination detection based on the internal states of large language models. InFindings of the Association for Computational Linguistics ACL 2024, pages 14379–14391, 2024

work page 2024
[35]

Detection and mitigation of hallucination in large reasoning models: A mechanistic perspective.arXiv preprint arXiv:2505.12886, 2025

Zhongxiang Sun, Qipeng Wang, Haoyu Wang, Xiao Zhang, and Jun Xu. Detection and mitigation of hallucination in large reasoning models: A mechanistic perspective.arXiv preprint arXiv:2505.12886, 2025

work page arXiv 2025
[36]

Joint evaluation of answer and reasoning consistency for hallucination detection in large reasoning models.arXiv preprint arXiv:2506.04832, 2025

Changyue Wang, Weihang Su, Qingyao Ai, and Yiqun Liu. Joint evaluation of answer and reasoning consistency for hallucination detection in large reasoning models.arXiv preprint arXiv:2506.04832, 2025

work page arXiv 2025
[37]

Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022

work page 2022
[38]

Can LLMs express their uncertainty? an empirical evaluation of confidence elicitation in LLMs

Miao Xiong, Zhiyuan Hu, Xinyang Lu, YIFEI LI, Jie Fu, Junxian He, and Bryan Hooi. Can LLMs express their uncertainty? an empirical evaluation of confidence elicitation in LLMs. InThe Twelfth International Conference on Learning Representations, 2024

work page 2024
[39]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[40]

Edward Yeo, Yuxuan Tong, Morry Niu, Graham Neu- big, and Xiang Yue

Zijun Yao, Yantao Liu, Yanxu Chen, Jianhui Chen, Junfeng Fang, Lei Hou, Juanzi Li, and Tat-Seng Chua. Are reasoning models more prone to hallucination?arXiv preprint arXiv:2505.23646, 2025

work page arXiv 2025
[41]

Siren’s song in the ai ocean: A survey on hallucination in large language models.Computational Linguistics, pages 1–46, 2025

Yue Zhang, Yafu Li, Leyang Cui, Deng Cai, Lemao Liu, Tingchen Fu, Xinting Huang, Enbo Zhao, Yu Zhang, Yulong Chen, et al. Siren’s song in the ai ocean: A survey on hallucination in large language models.Computational Linguistics, pages 1–46, 2025

work page 2025
[42]

CORRECT",

Xiaoling Zhou, Mingjie Zhang, Zhemg Lee, Wei Ye, and Shikun Zhang. Hademif: Hallucination detection and mitigation in large language models. InThe Thirteenth International Conference on Learning Representations, 2025. 11 A Datasets and Implementation Details A.1 Input Prompts We provide the detailed textual input as prompts to the language models for diff...

work page 2025
[43]

Maintain the overall style and tone of the original context

work page
[44]

Introduce 2-3 pieces of plausible but incorrect or unrelated information

work page
[45]

Avoid obviously fabricated statements

work page
[46]

Keep most original content; integrate misleading parts naturally

work page
[47]

Yes" or

Output ONLY the perturbed context. Original context: {original_context} Prompt for ReasoningTrace Paraphrasing Figure 9:Prompt for reasoning trace paraphrasing. We empirically explored many prompting variants and found this paraphrasing with light information injection can produce reasonably good hallucination detection performance. <|im_start|>user Quest...

work page arXiv 1961