TPA: Next Token Probability Attribution for Detecting Hallucinations in RAG
Pith reviewed 2026-05-17 00:59 UTC · model grok-4.3
The pith
Attributing each next-token probability to seven model sources and grouping by part-of-speech tags detects hallucinations in RAG outputs.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
TPA mathematically attributes each token's probability to seven distinct sources: Query, RAG Context, Past Token, Self Token, FFN, Final LayerNorm, and Initial Embedding. Aggregating the attribution scores by Part-of-Speech tags reveals component-specific anomalies, such as nouns relying excessively on Final LayerNorm, that reliably indicate hallucinated responses in RAG.
What carries the argument
The seven-source decomposition of next-token probability, aggregated by POS tags to surface anomalies like noun over-reliance on LayerNorm.
If this is right
- Hallucination detection works by tracking how much each linguistic category draws from different model components rather than from context.
- Nouns that depend mainly on LayerNorm adjustments rather than RAG context reliably flag hallucinations.
- The method supplies interpretable signals about which internal component drives factual errors during generation.
- Detection improves by incorporating query and token-history effects instead of limiting analysis to a binary knowledge-context conflict.
- State-of-the-art results support deployment for increasing factual reliability in RAG systems.
Where Pith is reading between the lines
- The same attribution technique may help locate other generation problems such as inconsistencies that do not involve retrieval.
- POS patterns could guide targeted adjustments to specific components like LayerNorm during model development.
- Accounting for interactions among the seven sources might produce still sharper detection in follow-on work.
Load-bearing premise
The seven sources fully and additively account for each token probability without large interaction effects or model artifacts that would invalidate the POS anomaly signals.
What would settle it
A collection of RAG responses in which nouns show high LayerNorm attribution yet the content remains factually correct and grounded in the retrieved context, or hallucinated responses that lack any such POS anomalies.
Figures
read the original abstract
Detecting hallucinations in Retrieval-Augmented Generation remains a challenge. Prior approaches attribute hallucinations to a binary conflict between internal knowledge stored in FFNs and the retrieved context. However, this perspective is incomplete, failing to account for the impact of other components of the LLM, such as the user query, previously generated tokens, the self token, and the final LayerNorm adjustment. To comprehensively capture the impact of these components on hallucination detection, we propose TPA which mathematically attributes each token's probability to seven distinct sources: Query, RAG Context, Past Token, Self Token, FFN, Final LayerNorm, and Initial Embedding. This attribution quantifies how each source contributes to the generation of the next token. Specifically, we aggregate these attribution scores by Part-of-Speech (POS) tags to quantify the contribution of each model component to the generation of specific linguistic categories within a response. By leveraging these patterns, such as detecting anomalies where Nouns rely heavily on LayerNorm, TPA effectively identifies hallucinated responses. Extensive experiments show that TPA achieves state-of-the-art performance.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes TPA, which mathematically attributes each next-token probability to seven sources (Query, RAG Context, Past Token, Self Token, FFN, Final LayerNorm, Initial Embedding) and aggregates the scores by POS tags to surface anomalies (e.g., nouns relying heavily on LayerNorm) that indicate hallucinations in RAG outputs, reporting state-of-the-art detection performance.
Significance. If the seven-source decomposition is shown to be additive and the POS anomalies are shown to be reliable signals rather than artifacts, the work would meaningfully extend hallucination detection beyond binary FFN-versus-context views by offering component-level and linguistically grouped interpretability.
major comments (2)
- [§3] §3 (Attribution derivation): the central claim requires that next-token probability decomposes completely and additively into the seven listed sources so that POS aggregation can reliably detect anomalies. Transformer non-linearities (attention softmax, FFN activations, LayerNorm scaling) produce interaction effects; the manuscript does not verify that the sum of the seven attributions reconstructs the original probability or quantify any residual. Without this check the POS-based anomaly detection rests on an unverified assumption.
- [§4] §4 (Experimental results): the SOTA performance is presented without reported error bars, ablation on POS grouping rules or anomaly thresholds, or statistical tests comparing against baselines. This makes it difficult to assess whether the reported gains are robust or sensitive to post-hoc choices in the anomaly definition.
minor comments (1)
- [Abstract] Abstract: the seven sources are listed but 'Self Token' is not defined; a one-sentence gloss would improve immediate clarity.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive feedback. We address each major comment below and will revise the manuscript accordingly to strengthen the claims and experimental rigor.
read point-by-point responses
-
Referee: [§3] §3 (Attribution derivation): the central claim requires that next-token probability decomposes completely and additively into the seven listed sources so that POS aggregation can reliably detect anomalies. Transformer non-linearities (attention softmax, FFN activations, LayerNorm scaling) produce interaction effects; the manuscript does not verify that the sum of the seven attributions reconstructs the original probability or quantify any residual. Without this check the POS-based anomaly detection rests on an unverified assumption.
Authors: We agree that an explicit verification of additivity is necessary to support the reliability of POS-based anomaly detection. Our derivation isolates component contributions via targeted logit decomposition, but we acknowledge that non-linearities can produce small interaction residuals not captured in the original submission. In the revision we will add a verification analysis (new subsection in §3 or appendix) that sums the seven attributions and reports the mean absolute reconstruction error relative to the original next-token probability across the evaluation sets, along with any adjustments to the anomaly scoring if residuals prove non-negligible. revision: yes
-
Referee: [§4] §4 (Experimental results): the SOTA performance is presented without reported error bars, ablation on POS grouping rules or anomaly thresholds, or statistical tests comparing against baselines. This makes it difficult to assess whether the reported gains are robust or sensitive to post-hoc choices in the anomaly definition.
Authors: We concur that the absence of error bars, ablations, and statistical tests limits assessment of robustness. The original results used a single fixed threshold and standard POS tagging without reporting variability. In the revised §4 we will report mean and standard deviation over five independent runs, include ablations on anomaly thresholds (e.g., 10–30 % LayerNorm reliance) and POS grouping variants (e.g., coarse vs. fine tags), and add statistical comparisons (bootstrap confidence intervals and McNemar’s test) against baselines to demonstrate that performance gains are stable rather than sensitive to post-hoc choices. revision: yes
Circularity Check
No significant circularity detected in derivation chain
full rationale
The paper derives TPA via a proposed mathematical attribution of next-token log-probabilities to seven explicitly enumerated model components (Query, RAG Context, Past Token, Self Token, FFN, Final LayerNorm, Initial Embedding), followed by POS-tag aggregation to surface anomalies. This decomposition is presented as an internal mechanistic breakdown rather than a fit to hallucination labels or a self-referential definition. No equations reduce the target anomaly detection to the inputs by construction, no load-bearing self-citations appear, and the SOTA claims rest on external benchmark experiments. The method is therefore self-contained against independent evaluation data.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Next-token probability can be attributed additively to Query, RAG Context, Past Token, Self Token, FFN, Final LayerNorm, and Initial Embedding
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We define ... ΔPinitial(y) = Φ(h(0), y) ... Pfinal(y) = ΔPinitial(y) + ΔPLN + Σ(ΔP(l)att + ΔP(l)ffn) (Theorem 1)
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanabsolute_floor_iff_bare_distinguishability unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
the attribution scores of these seven parts sum to the token’s final probability
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Looking for a needle in a haystack: A com- prehensive study of hallucinations in neural machine translation. InProceedings of the 17th Conference of the European Chapter of the Association for Compu- tational Linguistics, pages 1059–1075. Jiatong Han, Jannik Kossen, Muhammed Razzak, Lisa Schut, Shreshth A Malik, and Yarin Gal. 2024. Se- mantic entropy pro...
-
[2]
Ragtruth: A hallucination corpus for develop- ing trustworthy retrieval-augmented language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 10862–10878. nostalgebraist. 2020. interpreting gpt: the logit lens. LessWrong. Seongheon Park, Xuefeng Du, Min-Hsuan Yeh, Haobo Wang, an...
-
[3]
Redeep: Detecting hallucination in retrieval- augmented generation via mechanistic interpretabil- ity. InICLR. Hugo Touvron, Louis Martin, Kevin Stone, Peter Al- bert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, and 1 others. 2023. Llama 2: Open foun- dation and fine-tuned chat models.arXiv preprint...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[4]
EigenScore/INSIDE(Chen et al., 2024) Fo- cus on detecting hallucination by evaluating response’s semantic consistency, which is de- fined as the logarithm determinant of conva- riance matrix LLM’s internal states during generating the response
work page 2024
-
[5]
SEP(Han et al., 2024) Proposed a linear model to detect hallucination based on seman- tic entropy in test time whithout requiring mul- tiple responses
work page 2024
-
[6]
SAPLMA(Azaria and Mitchell, 2023) Detect- ing hallucination based on the hidden layer activations of LLMs
work page 2023
-
[7]
ITI(Li et al., 2023) Detecting hallucination based on the hidden layer activations of LLMs
work page 2023
-
[8]
Ragtruth Prompt(Niu et al., 2024) Provdes prompts for a LLM-as-judge to detect halluci- nation in RAG setting
work page 2024
-
[9]
ChainPoll(Friel and Sanyal, 2023) Provdes prompts for a LLM-as-judge to detect halluci- nation in RAG setting
work page 2023
-
[10]
If any statement is not supported, the response is considered hallucinated
RAGAS(Es et al., 2024) It use a LLM to split the response into a set of statements and verify each statement is supported by the retrieved documents. If any statement is not supported, the response is considered hallucinated
work page 2024
-
[11]
Trulens(TrueLens, 2024) Evaluating the over- lap between the retrieved documents and the generated response to detect hallucination by a LLM
work page 2024
-
[12]
P(True)(Kadavath et al., 2022) The paper de- tects hallucinations by having the model es- timate the probability that its own generated answer is correct, based on the key assumption that it is often easier for a model to recognize a correct answer than to generate one
work page 2022
-
[13]
SelfCheckGPT(Manakul et al., 2023) Self- CheckGPT detects hallucinations by checking for informational consistency across multiple stochastically sampled responses, based on the assumption that factual knowledge leads to consistent statements while hallucinations lead to divergent and contradictory ones
work page 2023
-
[14]
LN-Entropy(Malinin and Gales, 2021) This paper detects hallucinations by quantifying knowledge uncertainty, which it measures pri- marily with a novel metric called Reverse Mu- tual Information that captures the disagree- ment across an ensemble’s predictions, with high RMI indicating a likely hallucination
work page 2021
-
[15]
Energy(Liu et al., 2020) This paper detects hallucinations by using an energy score, de- rived directly from the model’s logits, as a more reliable uncertainty measure than soft- max confidence to identify out-of-distribution inputs that cause the model to hallucinate
work page 2020
-
[16]
Focus(Zhang et al., 2023) This paper detects hallucinations by calculating an uncertainty score focused on keywords, and then refines it by propagating penalties from unreliable con- text via attention and correcting token prob- abilities using entity types and inverse doc- ument frequency to mitigate both overconfi- dence and underconfidence
work page 2023
-
[17]
Perplexity(Ren et al., 2023) This paper de- tects hallucinations by separately measuring the Relative Mahalanobis Distance for both input and output embeddings, based on the as- sumption that in-domain examples will have embeddings closer to their respective fore- ground (in-domain) distributions than to a generic background distribution
work page 2023
-
[18]
REFCHECKER(Hu et al., 2024) It use a LLM to extract claim-triplets from a response and verify them by another LLM to detect hallucination
work page 2024
-
[19]
This method has two version: token level and chunk level
REDEEP(Sun et al., 2025) It detects halluci- nation by analyzing the balance between the contributions from Copying Heads that pro- cess external context and Knowledge FFNs that inject internal knowledge, based on the finding that RAG hallucinations often arise from conflicts between these two sources. This method has two version: token level and chunk le...
work page 2025
-
[20]
NoVo(Ho et al., 2025) It leverages the L2 norms of specific attention heads as reliable indicators of truthfulness. By identifying a subset of truth-correlated heads from a small reference set, it employs a voting mechanism based on these head norms to detect hallu- cinations without requiring model parameter updates
work page 2025
-
[21]
TSV(Park et al., 2025) It introduces a lightweight steering vector to reshape the LLM’s latent space during inference. By ac- tively intervening to enhance the linear sepa- rability between truthful and hallucinated rep- resentations in the hidden states, it enables effective detection using a simple classifier on the steered embeddings. 13 Complexity Ana...
work page 2025
-
[22]
The bottleneck is the calculation of the global partition function (denominator) in Softmax
Complete Probability Decomposition.To satisfy Theorem 1, we must compute the com- plete probability changes using the probe function Φ(h, y). The bottleneck is the calculation of the global partition function (denominator) in Softmax. • Mechanism:The probe function Φ(h, y) = Softmax(hWU)y = exp(w⊤ U,yh)P v∈V exp(w⊤ U,vh) re- quires projecting the hidden s...
-
[23]
Global Components:For ∆Pinitial and ∆PLN, the probe is called once per gen- eration step. Cost:O(T· |V| ·d)
-
[24]
Summing over L layers, this costs O(L·T· |V| ·d)
Layer Components:For ∆P (l) att and ∆P (l) ffn , the probe is invoked twice per layer (before and after the residual up- date). Summing over L layers, this costs O(L·T· |V| ·d). • Stage Complexity:Combining these terms, the dominant complexity is O(L·T· |V| ·d)
-
[25]
Head-wise Attribution.Once ∆P (l) att is ob- tained, we apportion it to individual heads based on their contribution to the target logit. • Mechanism:This attribution requires project- ing the target token vector wU,y back into the hidden state space using the layer’s output projection matrixW (l) O ∈R d×d. • Step Complexity:The calculation proceeds in tw...
-
[26]
Since W(l) O is a d×d matrix, this matrix-vector multiplication costsO(d 2)
Projection:We compute the projected target vector g= (W (l) O )⊤wU,y. Since W(l) O is a d×d matrix, this matrix-vector multiplication costsO(d 2)
-
[27]
Assignment:We distribute the contribu- tion to H heads by performing dot prod- ucts between the head outputs oh and the corresponding segments of g. For H heads, this sums toO(d). • Stage Complexity:The projection step (O(d2)) dominates the assignment step (O(d)). Integrating over L layers and T to- kens, the total complexity isO(L·T·d 2)
-
[28]
Mapping Attention to Input Sources.Fi- nally, we map head contributions to the four sources by aggregating attention weights A∈ RH×|s|×|s|. This involves two distinct sub-steps for each generated token at step t within a single layer: • Step 1: Summation.For each head h, we sum the attention weights corresponding to specific source indices (e.g.,I RAG): w...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.