TPA: Next Token Probability Attribution for Detecting Hallucinations in RAG

Anjin Liu; Guangquan Zhang; Jie Lu; Pengqian Lu

arxiv: 2512.07515 · v4 · submitted 2025-12-08 · 💻 cs.CL · cs.AI

TPA: Next Token Probability Attribution for Detecting Hallucinations in RAG

Pengqian Lu , Jie Lu , Anjin Liu , Guangquan Zhang This is my paper

Pith reviewed 2026-05-17 00:59 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords hallucination detectionretrieval-augmented generationtoken probability attributionpart-of-speech analysislarge language modelsRAGmodel interpretability

0 comments

The pith

Attributing each next-token probability to seven model sources and grouping by part-of-speech tags detects hallucinations in RAG outputs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper addresses the incomplete picture from prior work that viewed hallucinations mainly as conflicts between internal knowledge in feed-forward networks and retrieved context. It proposes TPA to decompose the probability of every generated token into contributions from the query, RAG context, past tokens, self token, feed-forward network, final layer normalization, and initial embedding. These decomposed scores are then summed within each part-of-speech category to expose unusual patterns. One such pattern is nouns drawing heavily from layer normalization instead of the provided context, which marks a hallucinated response. Experiments show this yields stronger detection than earlier methods.

Core claim

TPA mathematically attributes each token's probability to seven distinct sources: Query, RAG Context, Past Token, Self Token, FFN, Final LayerNorm, and Initial Embedding. Aggregating the attribution scores by Part-of-Speech tags reveals component-specific anomalies, such as nouns relying excessively on Final LayerNorm, that reliably indicate hallucinated responses in RAG.

What carries the argument

The seven-source decomposition of next-token probability, aggregated by POS tags to surface anomalies like noun over-reliance on LayerNorm.

If this is right

Hallucination detection works by tracking how much each linguistic category draws from different model components rather than from context.
Nouns that depend mainly on LayerNorm adjustments rather than RAG context reliably flag hallucinations.
The method supplies interpretable signals about which internal component drives factual errors during generation.
Detection improves by incorporating query and token-history effects instead of limiting analysis to a binary knowledge-context conflict.
State-of-the-art results support deployment for increasing factual reliability in RAG systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same attribution technique may help locate other generation problems such as inconsistencies that do not involve retrieval.
POS patterns could guide targeted adjustments to specific components like LayerNorm during model development.
Accounting for interactions among the seven sources might produce still sharper detection in follow-on work.

Load-bearing premise

The seven sources fully and additively account for each token probability without large interaction effects or model artifacts that would invalidate the POS anomaly signals.

What would settle it

A collection of RAG responses in which nouns show high LayerNorm attribution yet the content remains factually correct and grounded in the retrieved context, or hallucinated responses that lack any such POS anomalies.

Figures

Figures reproduced from arXiv: 2512.07515 by Anjin Liu, Guangquan Zhang, Jie Lu, Pengqian Lu.

**Figure 2.** Figure 2: Overview of the TPA framework. (1) Coarse-Grained Decomposition: Complete decomposition of token probability into four components (Section 3.2). (2) Fine-Grained Attribution: Mapping attention contributions to four input sources via head-specific weights (Section 3.3). (3) Syntax-Aware Feature Engineering: Aggregating these attributions by POS tags to construct the final detection features (Section 3.3.4).… view at source ↗

**Figure 3.** Figure 3: SHAP summary plots illustrating the decision logic. We visualize the top-10 features for classifiers trained [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: F1 Score Drop by Removing Components. sults for Llama3-8B and Mistral-7B are detailed in Appendix due to space constraints. We obtain three observations from this analysis. Observation 1: Fine-grained attribution is necessary. Relying solely on the binary conflict between internal FFN knowledge and external RAG context is insufficient for robust detection. As shown in Figure 3, the classifier frequently … view at source ↗

**Figure 5.** Figure 5: SHAP summary plots illustrating the decision logic. We visualize the top-10 features for classifiers trained [PITH_FULL_IMAGE:figures/full_fig_p019_5.png] view at source ↗

read the original abstract

Detecting hallucinations in Retrieval-Augmented Generation remains a challenge. Prior approaches attribute hallucinations to a binary conflict between internal knowledge stored in FFNs and the retrieved context. However, this perspective is incomplete, failing to account for the impact of other components of the LLM, such as the user query, previously generated tokens, the self token, and the final LayerNorm adjustment. To comprehensively capture the impact of these components on hallucination detection, we propose TPA which mathematically attributes each token's probability to seven distinct sources: Query, RAG Context, Past Token, Self Token, FFN, Final LayerNorm, and Initial Embedding. This attribution quantifies how each source contributes to the generation of the next token. Specifically, we aggregate these attribution scores by Part-of-Speech (POS) tags to quantify the contribution of each model component to the generation of specific linguistic categories within a response. By leveraging these patterns, such as detecting anomalies where Nouns rely heavily on LayerNorm, TPA effectively identifies hallucinated responses. Extensive experiments show that TPA achieves state-of-the-art performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TPA's seven-source attribution plus POS aggregation for RAG hallucination detection is a concrete extension of binary FFN-context ideas, but the additivity needed for clean anomaly signals looks shaky given transformer non-linearities.

read the letter

The main thing to know is that this paper proposes TPA, which attributes next-token probabilities in RAG to seven sources and uses part-of-speech anomalies to detect hallucinations. It claims state-of-the-art results by spotting things like nouns depending too much on the final LayerNorm. What is actually new here is the explicit seven-component breakdown that includes the query, past tokens, self-token, initial embeddings, and final LayerNorm on top of the usual FFN and RAG context. The POS aggregation step to turn those scores into a detection method is also a fresh angle compared to the binary conflict approaches cited. The paper does a good job of formalizing the attribution mathematically and showing experimental gains over baselines. It engages with the existing literature on internal knowledge versus context conflicts without ignoring it. Where it gets softer is on the additivity of the decomposition. Given the non-linearities in attention and activations, the seven sources likely have interaction effects that aren't captured, which could make the POS-based anomalies less reliable as hallucination indicators. The stress-test concern about residuals and misallocation seems to land here unless the full paper has strong validation like intervention tests or small residual checks. The choice of POS tags and anomaly thresholds might also introduce some post-hoc fitting, though that's probably minor. Readers in LLM safety and RAG deployment would find this relevant for practical detection improvements. It is the kind of work that deserves a serious referee because the idea is implementable and the claims are concrete enough to check against data. I would recommend sending it for peer review, focusing the referees on whether the attribution holds up under closer scrutiny of the model's non-linear components.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes TPA, which mathematically attributes each next-token probability to seven sources (Query, RAG Context, Past Token, Self Token, FFN, Final LayerNorm, Initial Embedding) and aggregates the scores by POS tags to surface anomalies (e.g., nouns relying heavily on LayerNorm) that indicate hallucinations in RAG outputs, reporting state-of-the-art detection performance.

Significance. If the seven-source decomposition is shown to be additive and the POS anomalies are shown to be reliable signals rather than artifacts, the work would meaningfully extend hallucination detection beyond binary FFN-versus-context views by offering component-level and linguistically grouped interpretability.

major comments (2)

[§3] §3 (Attribution derivation): the central claim requires that next-token probability decomposes completely and additively into the seven listed sources so that POS aggregation can reliably detect anomalies. Transformer non-linearities (attention softmax, FFN activations, LayerNorm scaling) produce interaction effects; the manuscript does not verify that the sum of the seven attributions reconstructs the original probability or quantify any residual. Without this check the POS-based anomaly detection rests on an unverified assumption.
[§4] §4 (Experimental results): the SOTA performance is presented without reported error bars, ablation on POS grouping rules or anomaly thresholds, or statistical tests comparing against baselines. This makes it difficult to assess whether the reported gains are robust or sensitive to post-hoc choices in the anomaly definition.

minor comments (1)

[Abstract] Abstract: the seven sources are listed but 'Self Token' is not defined; a one-sentence gloss would improve immediate clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive feedback. We address each major comment below and will revise the manuscript accordingly to strengthen the claims and experimental rigor.

read point-by-point responses

Referee: [§3] §3 (Attribution derivation): the central claim requires that next-token probability decomposes completely and additively into the seven listed sources so that POS aggregation can reliably detect anomalies. Transformer non-linearities (attention softmax, FFN activations, LayerNorm scaling) produce interaction effects; the manuscript does not verify that the sum of the seven attributions reconstructs the original probability or quantify any residual. Without this check the POS-based anomaly detection rests on an unverified assumption.

Authors: We agree that an explicit verification of additivity is necessary to support the reliability of POS-based anomaly detection. Our derivation isolates component contributions via targeted logit decomposition, but we acknowledge that non-linearities can produce small interaction residuals not captured in the original submission. In the revision we will add a verification analysis (new subsection in §3 or appendix) that sums the seven attributions and reports the mean absolute reconstruction error relative to the original next-token probability across the evaluation sets, along with any adjustments to the anomaly scoring if residuals prove non-negligible. revision: yes
Referee: [§4] §4 (Experimental results): the SOTA performance is presented without reported error bars, ablation on POS grouping rules or anomaly thresholds, or statistical tests comparing against baselines. This makes it difficult to assess whether the reported gains are robust or sensitive to post-hoc choices in the anomaly definition.

Authors: We concur that the absence of error bars, ablations, and statistical tests limits assessment of robustness. The original results used a single fixed threshold and standard POS tagging without reporting variability. In the revised §4 we will report mean and standard deviation over five independent runs, include ablations on anomaly thresholds (e.g., 10–30 % LayerNorm reliance) and POS grouping variants (e.g., coarse vs. fine tags), and add statistical comparisons (bootstrap confidence intervals and McNemar’s test) against baselines to demonstrate that performance gains are stable rather than sensitive to post-hoc choices. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in derivation chain

full rationale

The paper derives TPA via a proposed mathematical attribution of next-token log-probabilities to seven explicitly enumerated model components (Query, RAG Context, Past Token, Self Token, FFN, Final LayerNorm, Initial Embedding), followed by POS-tag aggregation to surface anomalies. This decomposition is presented as an internal mechanistic breakdown rather than a fit to hallucination labels or a self-referential definition. No equations reduce the target anomaly detection to the inputs by construction, no load-bearing self-citations appear, and the SOTA claims rest on external benchmark experiments. The method is therefore self-contained against independent evaluation data.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The method rests on the assumption that token probability can be cleanly decomposed into the listed seven additive sources without residual interactions. No free parameters are explicitly named in the abstract, though anomaly thresholds for POS patterns are likely tuned. No new physical entities are introduced.

axioms (1)

domain assumption Next-token probability can be attributed additively to Query, RAG Context, Past Token, Self Token, FFN, Final LayerNorm, and Initial Embedding
Stated in the abstract as the basis for TPA

pith-pipeline@v0.9.0 · 5488 in / 1309 out tokens · 90803 ms · 2026-05-17T00:59:22.614912+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We define ... ΔPinitial(y) = Φ(h(0), y) ... Pfinal(y) = ΔPinitial(y) + ΔPLN + Σ(ΔP(l)att + ΔP(l)ffn) (Theorem 1)
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

the attribution scores of these seven parts sum to the token’s final probability

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages · 1 internal anchor

[1]

RefChecker: Reference-based Fine-grained Hallucination Checker and Benchmark for Large Language Models, May 2024

Looking for a needle in a haystack: A com- prehensive study of hallucinations in neural machine translation. InProceedings of the 17th Conference of the European Chapter of the Association for Compu- tational Linguistics, pages 1059–1075. Jiatong Han, Jannik Kossen, Muhammed Razzak, Lisa Schut, Shreshth A Malik, and Yarin Gal. 2024. Se- mantic entropy pro...

work page arXiv 2024
[2]

InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 10862–10878

Ragtruth: A hallucination corpus for develop- ing trustworthy retrieval-augmented language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 10862–10878. nostalgebraist. 2020. interpreting gpt: the logit lens. LessWrong. Seongheon Park, Xuefeng Du, Min-Hsuan Yeh, Haobo Wang, an...

work page arXiv 2020
[3]

Redeep: Detecting hallucination in retrieval- augmented generation via mechanistic interpretabil- ity. InICLR. Hugo Touvron, Louis Martin, Kevin Stone, Peter Al- bert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, and 1 others. 2023. Llama 2: Open foun- dation and fine-tuned chat models.arXiv preprint...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[4]

EigenScore/INSIDE(Chen et al., 2024) Fo- cus on detecting hallucination by evaluating response’s semantic consistency, which is de- fined as the logarithm determinant of conva- riance matrix LLM’s internal states during generating the response

work page 2024
[5]

SEP(Han et al., 2024) Proposed a linear model to detect hallucination based on seman- tic entropy in test time whithout requiring mul- tiple responses

work page 2024
[6]

SAPLMA(Azaria and Mitchell, 2023) Detect- ing hallucination based on the hidden layer activations of LLMs

work page 2023
[7]

ITI(Li et al., 2023) Detecting hallucination based on the hidden layer activations of LLMs

work page 2023
[8]

Ragtruth Prompt(Niu et al., 2024) Provdes prompts for a LLM-as-judge to detect halluci- nation in RAG setting

work page 2024
[9]

ChainPoll(Friel and Sanyal, 2023) Provdes prompts for a LLM-as-judge to detect halluci- nation in RAG setting

work page 2023
[10]

If any statement is not supported, the response is considered hallucinated

RAGAS(Es et al., 2024) It use a LLM to split the response into a set of statements and verify each statement is supported by the retrieved documents. If any statement is not supported, the response is considered hallucinated

work page 2024
[11]

Trulens(TrueLens, 2024) Evaluating the over- lap between the retrieved documents and the generated response to detect hallucination by a LLM

work page 2024
[12]

P(True)(Kadavath et al., 2022) The paper de- tects hallucinations by having the model es- timate the probability that its own generated answer is correct, based on the key assumption that it is often easier for a model to recognize a correct answer than to generate one

work page 2022
[13]

SelfCheckGPT(Manakul et al., 2023) Self- CheckGPT detects hallucinations by checking for informational consistency across multiple stochastically sampled responses, based on the assumption that factual knowledge leads to consistent statements while hallucinations lead to divergent and contradictory ones

work page 2023
[14]

LN-Entropy(Malinin and Gales, 2021) This paper detects hallucinations by quantifying knowledge uncertainty, which it measures pri- marily with a novel metric called Reverse Mu- tual Information that captures the disagree- ment across an ensemble’s predictions, with high RMI indicating a likely hallucination

work page 2021
[15]

Energy(Liu et al., 2020) This paper detects hallucinations by using an energy score, de- rived directly from the model’s logits, as a more reliable uncertainty measure than soft- max confidence to identify out-of-distribution inputs that cause the model to hallucinate

work page 2020
[16]

Focus(Zhang et al., 2023) This paper detects hallucinations by calculating an uncertainty score focused on keywords, and then refines it by propagating penalties from unreliable con- text via attention and correcting token prob- abilities using entity types and inverse doc- ument frequency to mitigate both overconfi- dence and underconfidence

work page 2023
[17]

Perplexity(Ren et al., 2023) This paper de- tects hallucinations by separately measuring the Relative Mahalanobis Distance for both input and output embeddings, based on the as- sumption that in-domain examples will have embeddings closer to their respective fore- ground (in-domain) distributions than to a generic background distribution

work page 2023
[18]

REFCHECKER(Hu et al., 2024) It use a LLM to extract claim-triplets from a response and verify them by another LLM to detect hallucination

work page 2024
[19]

This method has two version: token level and chunk level

REDEEP(Sun et al., 2025) It detects halluci- nation by analyzing the balance between the contributions from Copying Heads that pro- cess external context and Knowledge FFNs that inject internal knowledge, based on the finding that RAG hallucinations often arise from conflicts between these two sources. This method has two version: token level and chunk le...

work page 2025
[20]

NoVo(Ho et al., 2025) It leverages the L2 norms of specific attention heads as reliable indicators of truthfulness. By identifying a subset of truth-correlated heads from a small reference set, it employs a voting mechanism based on these head norms to detect hallu- cinations without requiring model parameter updates

work page 2025
[21]

TSV(Park et al., 2025) It introduces a lightweight steering vector to reshape the LLM’s latent space during inference. By ac- tively intervening to enhance the linear sepa- rability between truthful and hallucinated rep- resentations in the hidden states, it enables effective detection using a simple classifier on the steered embeddings. 13 Complexity Ana...

work page 2025
[22]

The bottleneck is the calculation of the global partition function (denominator) in Softmax

Complete Probability Decomposition.To satisfy Theorem 1, we must compute the com- plete probability changes using the probe function Φ(h, y). The bottleneck is the calculation of the global partition function (denominator) in Softmax. • Mechanism:The probe function Φ(h, y) = Softmax(hWU)y = exp(w⊤ U,yh)P v∈V exp(w⊤ U,vh) re- quires projecting the hidden s...

work page
[23]

Cost:O(T· |V| ·d)

Global Components:For ∆Pinitial and ∆PLN, the probe is called once per gen- eration step. Cost:O(T· |V| ·d)

work page
[24]

Summing over L layers, this costs O(L·T· |V| ·d)

Layer Components:For ∆P (l) att and ∆P (l) ffn , the probe is invoked twice per layer (before and after the residual up- date). Summing over L layers, this costs O(L·T· |V| ·d). • Stage Complexity:Combining these terms, the dominant complexity is O(L·T· |V| ·d)

work page
[25]

• Mechanism:This attribution requires project- ing the target token vector wU,y back into the hidden state space using the layer’s output projection matrixW (l) O ∈R d×d

Head-wise Attribution.Once ∆P (l) att is ob- tained, we apportion it to individual heads based on their contribution to the target logit. • Mechanism:This attribution requires project- ing the target token vector wU,y back into the hidden state space using the layer’s output projection matrixW (l) O ∈R d×d. • Step Complexity:The calculation proceeds in tw...

work page
[26]

Since W(l) O is a d×d matrix, this matrix-vector multiplication costsO(d 2)

Projection:We compute the projected target vector g= (W (l) O )⊤wU,y. Since W(l) O is a d×d matrix, this matrix-vector multiplication costsO(d 2)

work page
[27]

For H heads, this sums toO(d)

Assignment:We distribute the contribu- tion to H heads by performing dot prod- ucts between the head outputs oh and the corresponding segments of g. For H heads, this sums toO(d). • Stage Complexity:The projection step (O(d2)) dominates the assignment step (O(d)). Integrating over L layers and T to- kens, the total complexity isO(L·T·d 2)

work page
[28]

Mapping Attention to Input Sources.Fi- nally, we map head contributions to the four sources by aggregating attention weights A∈ RH×|s|×|s|. This involves two distinct sub-steps for each generated token at step t within a single layer: • Step 1: Summation.For each head h, we sum the attention weights corresponding to specific source indices (e.g.,I RAG): w...

work page arXiv 2023

[1] [1]

RefChecker: Reference-based Fine-grained Hallucination Checker and Benchmark for Large Language Models, May 2024

Looking for a needle in a haystack: A com- prehensive study of hallucinations in neural machine translation. InProceedings of the 17th Conference of the European Chapter of the Association for Compu- tational Linguistics, pages 1059–1075. Jiatong Han, Jannik Kossen, Muhammed Razzak, Lisa Schut, Shreshth A Malik, and Yarin Gal. 2024. Se- mantic entropy pro...

work page arXiv 2024

[2] [2]

InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 10862–10878

Ragtruth: A hallucination corpus for develop- ing trustworthy retrieval-augmented language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 10862–10878. nostalgebraist. 2020. interpreting gpt: the logit lens. LessWrong. Seongheon Park, Xuefeng Du, Min-Hsuan Yeh, Haobo Wang, an...

work page arXiv 2020

[3] [3]

Redeep: Detecting hallucination in retrieval- augmented generation via mechanistic interpretabil- ity. InICLR. Hugo Touvron, Louis Martin, Kevin Stone, Peter Al- bert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, and 1 others. 2023. Llama 2: Open foun- dation and fine-tuned chat models.arXiv preprint...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[4] [4]

EigenScore/INSIDE(Chen et al., 2024) Fo- cus on detecting hallucination by evaluating response’s semantic consistency, which is de- fined as the logarithm determinant of conva- riance matrix LLM’s internal states during generating the response

work page 2024

[5] [5]

SEP(Han et al., 2024) Proposed a linear model to detect hallucination based on seman- tic entropy in test time whithout requiring mul- tiple responses

work page 2024

[6] [6]

SAPLMA(Azaria and Mitchell, 2023) Detect- ing hallucination based on the hidden layer activations of LLMs

work page 2023

[7] [7]

ITI(Li et al., 2023) Detecting hallucination based on the hidden layer activations of LLMs

work page 2023

[8] [8]

Ragtruth Prompt(Niu et al., 2024) Provdes prompts for a LLM-as-judge to detect halluci- nation in RAG setting

work page 2024

[9] [9]

ChainPoll(Friel and Sanyal, 2023) Provdes prompts for a LLM-as-judge to detect halluci- nation in RAG setting

work page 2023

[10] [10]

If any statement is not supported, the response is considered hallucinated

RAGAS(Es et al., 2024) It use a LLM to split the response into a set of statements and verify each statement is supported by the retrieved documents. If any statement is not supported, the response is considered hallucinated

work page 2024

[11] [11]

Trulens(TrueLens, 2024) Evaluating the over- lap between the retrieved documents and the generated response to detect hallucination by a LLM

work page 2024

[12] [12]

P(True)(Kadavath et al., 2022) The paper de- tects hallucinations by having the model es- timate the probability that its own generated answer is correct, based on the key assumption that it is often easier for a model to recognize a correct answer than to generate one

work page 2022

[13] [13]

SelfCheckGPT(Manakul et al., 2023) Self- CheckGPT detects hallucinations by checking for informational consistency across multiple stochastically sampled responses, based on the assumption that factual knowledge leads to consistent statements while hallucinations lead to divergent and contradictory ones

work page 2023

[14] [14]

LN-Entropy(Malinin and Gales, 2021) This paper detects hallucinations by quantifying knowledge uncertainty, which it measures pri- marily with a novel metric called Reverse Mu- tual Information that captures the disagree- ment across an ensemble’s predictions, with high RMI indicating a likely hallucination

work page 2021

[15] [15]

Energy(Liu et al., 2020) This paper detects hallucinations by using an energy score, de- rived directly from the model’s logits, as a more reliable uncertainty measure than soft- max confidence to identify out-of-distribution inputs that cause the model to hallucinate

work page 2020

[16] [16]

Focus(Zhang et al., 2023) This paper detects hallucinations by calculating an uncertainty score focused on keywords, and then refines it by propagating penalties from unreliable con- text via attention and correcting token prob- abilities using entity types and inverse doc- ument frequency to mitigate both overconfi- dence and underconfidence

work page 2023

[17] [17]

Perplexity(Ren et al., 2023) This paper de- tects hallucinations by separately measuring the Relative Mahalanobis Distance for both input and output embeddings, based on the as- sumption that in-domain examples will have embeddings closer to their respective fore- ground (in-domain) distributions than to a generic background distribution

work page 2023

[18] [18]

REFCHECKER(Hu et al., 2024) It use a LLM to extract claim-triplets from a response and verify them by another LLM to detect hallucination

work page 2024

[19] [19]

This method has two version: token level and chunk level

REDEEP(Sun et al., 2025) It detects halluci- nation by analyzing the balance between the contributions from Copying Heads that pro- cess external context and Knowledge FFNs that inject internal knowledge, based on the finding that RAG hallucinations often arise from conflicts between these two sources. This method has two version: token level and chunk le...

work page 2025

[20] [20]

NoVo(Ho et al., 2025) It leverages the L2 norms of specific attention heads as reliable indicators of truthfulness. By identifying a subset of truth-correlated heads from a small reference set, it employs a voting mechanism based on these head norms to detect hallu- cinations without requiring model parameter updates

work page 2025

[21] [21]

TSV(Park et al., 2025) It introduces a lightweight steering vector to reshape the LLM’s latent space during inference. By ac- tively intervening to enhance the linear sepa- rability between truthful and hallucinated rep- resentations in the hidden states, it enables effective detection using a simple classifier on the steered embeddings. 13 Complexity Ana...

work page 2025

[22] [22]

The bottleneck is the calculation of the global partition function (denominator) in Softmax

Complete Probability Decomposition.To satisfy Theorem 1, we must compute the com- plete probability changes using the probe function Φ(h, y). The bottleneck is the calculation of the global partition function (denominator) in Softmax. • Mechanism:The probe function Φ(h, y) = Softmax(hWU)y = exp(w⊤ U,yh)P v∈V exp(w⊤ U,vh) re- quires projecting the hidden s...

work page

[23] [23]

Cost:O(T· |V| ·d)

Global Components:For ∆Pinitial and ∆PLN, the probe is called once per gen- eration step. Cost:O(T· |V| ·d)

work page

[24] [24]

Summing over L layers, this costs O(L·T· |V| ·d)

Layer Components:For ∆P (l) att and ∆P (l) ffn , the probe is invoked twice per layer (before and after the residual up- date). Summing over L layers, this costs O(L·T· |V| ·d). • Stage Complexity:Combining these terms, the dominant complexity is O(L·T· |V| ·d)

work page

[25] [25]

• Mechanism:This attribution requires project- ing the target token vector wU,y back into the hidden state space using the layer’s output projection matrixW (l) O ∈R d×d

Head-wise Attribution.Once ∆P (l) att is ob- tained, we apportion it to individual heads based on their contribution to the target logit. • Mechanism:This attribution requires project- ing the target token vector wU,y back into the hidden state space using the layer’s output projection matrixW (l) O ∈R d×d. • Step Complexity:The calculation proceeds in tw...

work page

[26] [26]

Since W(l) O is a d×d matrix, this matrix-vector multiplication costsO(d 2)

Projection:We compute the projected target vector g= (W (l) O )⊤wU,y. Since W(l) O is a d×d matrix, this matrix-vector multiplication costsO(d 2)

work page

[27] [27]

For H heads, this sums toO(d)

Assignment:We distribute the contribu- tion to H heads by performing dot prod- ucts between the head outputs oh and the corresponding segments of g. For H heads, this sums toO(d). • Stage Complexity:The projection step (O(d2)) dominates the assignment step (O(d)). Integrating over L layers and T to- kens, the total complexity isO(L·T·d 2)

work page

[28] [28]

Mapping Attention to Input Sources.Fi- nally, we map head contributions to the four sources by aggregating attention weights A∈ RH×|s|×|s|. This involves two distinct sub-steps for each generated token at step t within a single layer: • Step 1: Summation.For each head h, we sum the attention weights corresponding to specific source indices (e.g.,I RAG): w...

work page arXiv 2023