pith. sign in

arxiv: 2512.07515 · v4 · submitted 2025-12-08 · 💻 cs.CL · cs.AI

TPA: Next Token Probability Attribution for Detecting Hallucinations in RAG

Pith reviewed 2026-05-17 00:59 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords hallucination detectionretrieval-augmented generationtoken probability attributionpart-of-speech analysislarge language modelsRAGmodel interpretability
0
0 comments X

The pith

Attributing each next-token probability to seven model sources and grouping by part-of-speech tags detects hallucinations in RAG outputs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper addresses the incomplete picture from prior work that viewed hallucinations mainly as conflicts between internal knowledge in feed-forward networks and retrieved context. It proposes TPA to decompose the probability of every generated token into contributions from the query, RAG context, past tokens, self token, feed-forward network, final layer normalization, and initial embedding. These decomposed scores are then summed within each part-of-speech category to expose unusual patterns. One such pattern is nouns drawing heavily from layer normalization instead of the provided context, which marks a hallucinated response. Experiments show this yields stronger detection than earlier methods.

Core claim

TPA mathematically attributes each token's probability to seven distinct sources: Query, RAG Context, Past Token, Self Token, FFN, Final LayerNorm, and Initial Embedding. Aggregating the attribution scores by Part-of-Speech tags reveals component-specific anomalies, such as nouns relying excessively on Final LayerNorm, that reliably indicate hallucinated responses in RAG.

What carries the argument

The seven-source decomposition of next-token probability, aggregated by POS tags to surface anomalies like noun over-reliance on LayerNorm.

If this is right

  • Hallucination detection works by tracking how much each linguistic category draws from different model components rather than from context.
  • Nouns that depend mainly on LayerNorm adjustments rather than RAG context reliably flag hallucinations.
  • The method supplies interpretable signals about which internal component drives factual errors during generation.
  • Detection improves by incorporating query and token-history effects instead of limiting analysis to a binary knowledge-context conflict.
  • State-of-the-art results support deployment for increasing factual reliability in RAG systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same attribution technique may help locate other generation problems such as inconsistencies that do not involve retrieval.
  • POS patterns could guide targeted adjustments to specific components like LayerNorm during model development.
  • Accounting for interactions among the seven sources might produce still sharper detection in follow-on work.

Load-bearing premise

The seven sources fully and additively account for each token probability without large interaction effects or model artifacts that would invalidate the POS anomaly signals.

What would settle it

A collection of RAG responses in which nouns show high LayerNorm attribution yet the content remains factually correct and grounded in the retrieved context, or hallucinated responses that lack any such POS anomalies.

Figures

Figures reproduced from arXiv: 2512.07515 by Anjin Liu, Guangquan Zhang, Jie Lu, Pengqian Lu.

Figure 1
Figure 1. Figure 1: Applying the TPA framework to a Llama2-7b [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the TPA framework. (1) Coarse-Grained Decomposition: Complete decomposition of token probability into four components (Section 3.2). (2) Fine-Grained Attribution: Mapping attention contributions to four input sources via head-specific weights (Section 3.3). (3) Syntax-Aware Feature Engineering: Aggregating these attributions by POS tags to construct the final detection features (Section 3.3.4).… view at source ↗
Figure 3
Figure 3. Figure 3: SHAP summary plots illustrating the decision logic. We visualize the top-10 features for classifiers trained [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: F1 Score Drop by Removing Components. sults for Llama3-8B and Mistral-7B are detailed in Appendix due to space constraints. We obtain three observations from this analysis. Observation 1: Fine-grained attribution is neces￾sary. Relying solely on the binary conflict between internal FFN knowledge and external RAG context is insufficient for robust detection. As shown in Fig￾ure 3, the classifier frequently … view at source ↗
Figure 5
Figure 5. Figure 5: SHAP summary plots illustrating the decision logic. We visualize the top-10 features for classifiers trained [PITH_FULL_IMAGE:figures/full_fig_p019_5.png] view at source ↗
read the original abstract

Detecting hallucinations in Retrieval-Augmented Generation remains a challenge. Prior approaches attribute hallucinations to a binary conflict between internal knowledge stored in FFNs and the retrieved context. However, this perspective is incomplete, failing to account for the impact of other components of the LLM, such as the user query, previously generated tokens, the self token, and the final LayerNorm adjustment. To comprehensively capture the impact of these components on hallucination detection, we propose TPA which mathematically attributes each token's probability to seven distinct sources: Query, RAG Context, Past Token, Self Token, FFN, Final LayerNorm, and Initial Embedding. This attribution quantifies how each source contributes to the generation of the next token. Specifically, we aggregate these attribution scores by Part-of-Speech (POS) tags to quantify the contribution of each model component to the generation of specific linguistic categories within a response. By leveraging these patterns, such as detecting anomalies where Nouns rely heavily on LayerNorm, TPA effectively identifies hallucinated responses. Extensive experiments show that TPA achieves state-of-the-art performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes TPA, which mathematically attributes each next-token probability to seven sources (Query, RAG Context, Past Token, Self Token, FFN, Final LayerNorm, Initial Embedding) and aggregates the scores by POS tags to surface anomalies (e.g., nouns relying heavily on LayerNorm) that indicate hallucinations in RAG outputs, reporting state-of-the-art detection performance.

Significance. If the seven-source decomposition is shown to be additive and the POS anomalies are shown to be reliable signals rather than artifacts, the work would meaningfully extend hallucination detection beyond binary FFN-versus-context views by offering component-level and linguistically grouped interpretability.

major comments (2)
  1. [§3] §3 (Attribution derivation): the central claim requires that next-token probability decomposes completely and additively into the seven listed sources so that POS aggregation can reliably detect anomalies. Transformer non-linearities (attention softmax, FFN activations, LayerNorm scaling) produce interaction effects; the manuscript does not verify that the sum of the seven attributions reconstructs the original probability or quantify any residual. Without this check the POS-based anomaly detection rests on an unverified assumption.
  2. [§4] §4 (Experimental results): the SOTA performance is presented without reported error bars, ablation on POS grouping rules or anomaly thresholds, or statistical tests comparing against baselines. This makes it difficult to assess whether the reported gains are robust or sensitive to post-hoc choices in the anomaly definition.
minor comments (1)
  1. [Abstract] Abstract: the seven sources are listed but 'Self Token' is not defined; a one-sentence gloss would improve immediate clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive feedback. We address each major comment below and will revise the manuscript accordingly to strengthen the claims and experimental rigor.

read point-by-point responses
  1. Referee: [§3] §3 (Attribution derivation): the central claim requires that next-token probability decomposes completely and additively into the seven listed sources so that POS aggregation can reliably detect anomalies. Transformer non-linearities (attention softmax, FFN activations, LayerNorm scaling) produce interaction effects; the manuscript does not verify that the sum of the seven attributions reconstructs the original probability or quantify any residual. Without this check the POS-based anomaly detection rests on an unverified assumption.

    Authors: We agree that an explicit verification of additivity is necessary to support the reliability of POS-based anomaly detection. Our derivation isolates component contributions via targeted logit decomposition, but we acknowledge that non-linearities can produce small interaction residuals not captured in the original submission. In the revision we will add a verification analysis (new subsection in §3 or appendix) that sums the seven attributions and reports the mean absolute reconstruction error relative to the original next-token probability across the evaluation sets, along with any adjustments to the anomaly scoring if residuals prove non-negligible. revision: yes

  2. Referee: [§4] §4 (Experimental results): the SOTA performance is presented without reported error bars, ablation on POS grouping rules or anomaly thresholds, or statistical tests comparing against baselines. This makes it difficult to assess whether the reported gains are robust or sensitive to post-hoc choices in the anomaly definition.

    Authors: We concur that the absence of error bars, ablations, and statistical tests limits assessment of robustness. The original results used a single fixed threshold and standard POS tagging without reporting variability. In the revised §4 we will report mean and standard deviation over five independent runs, include ablations on anomaly thresholds (e.g., 10–30 % LayerNorm reliance) and POS grouping variants (e.g., coarse vs. fine tags), and add statistical comparisons (bootstrap confidence intervals and McNemar’s test) against baselines to demonstrate that performance gains are stable rather than sensitive to post-hoc choices. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in derivation chain

full rationale

The paper derives TPA via a proposed mathematical attribution of next-token log-probabilities to seven explicitly enumerated model components (Query, RAG Context, Past Token, Self Token, FFN, Final LayerNorm, Initial Embedding), followed by POS-tag aggregation to surface anomalies. This decomposition is presented as an internal mechanistic breakdown rather than a fit to hallucination labels or a self-referential definition. No equations reduce the target anomaly detection to the inputs by construction, no load-bearing self-citations appear, and the SOTA claims rest on external benchmark experiments. The method is therefore self-contained against independent evaluation data.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The method rests on the assumption that token probability can be cleanly decomposed into the listed seven additive sources without residual interactions. No free parameters are explicitly named in the abstract, though anomaly thresholds for POS patterns are likely tuned. No new physical entities are introduced.

axioms (1)
  • domain assumption Next-token probability can be attributed additively to Query, RAG Context, Past Token, Self Token, FFN, Final LayerNorm, and Initial Embedding
    Stated in the abstract as the basis for TPA

pith-pipeline@v0.9.0 · 5488 in / 1309 out tokens · 90803 ms · 2026-05-17T00:59:22.614912+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages · 1 internal anchor

  1. [1]

    RefChecker: Reference-based Fine-grained Hallucination Checker and Benchmark for Large Language Models, May 2024

    Looking for a needle in a haystack: A com- prehensive study of hallucinations in neural machine translation. InProceedings of the 17th Conference of the European Chapter of the Association for Compu- tational Linguistics, pages 1059–1075. Jiatong Han, Jannik Kossen, Muhammed Razzak, Lisa Schut, Shreshth A Malik, and Yarin Gal. 2024. Se- mantic entropy pro...

  2. [2]

    InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 10862–10878

    Ragtruth: A hallucination corpus for develop- ing trustworthy retrieval-augmented language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 10862–10878. nostalgebraist. 2020. interpreting gpt: the logit lens. LessWrong. Seongheon Park, Xuefeng Du, Min-Hsuan Yeh, Haobo Wang, an...

  3. [3]

    Redeep: Detecting hallucination in retrieval- augmented generation via mechanistic interpretabil- ity. InICLR. Hugo Touvron, Louis Martin, Kevin Stone, Peter Al- bert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, and 1 others. 2023. Llama 2: Open foun- dation and fine-tuned chat models.arXiv preprint...

  4. [4]

    EigenScore/INSIDE(Chen et al., 2024) Fo- cus on detecting hallucination by evaluating response’s semantic consistency, which is de- fined as the logarithm determinant of conva- riance matrix LLM’s internal states during generating the response

  5. [5]

    SEP(Han et al., 2024) Proposed a linear model to detect hallucination based on seman- tic entropy in test time whithout requiring mul- tiple responses

  6. [6]

    SAPLMA(Azaria and Mitchell, 2023) Detect- ing hallucination based on the hidden layer activations of LLMs

  7. [7]

    ITI(Li et al., 2023) Detecting hallucination based on the hidden layer activations of LLMs

  8. [8]

    Ragtruth Prompt(Niu et al., 2024) Provdes prompts for a LLM-as-judge to detect halluci- nation in RAG setting

  9. [9]

    ChainPoll(Friel and Sanyal, 2023) Provdes prompts for a LLM-as-judge to detect halluci- nation in RAG setting

  10. [10]

    If any statement is not supported, the response is considered hallucinated

    RAGAS(Es et al., 2024) It use a LLM to split the response into a set of statements and verify each statement is supported by the retrieved documents. If any statement is not supported, the response is considered hallucinated

  11. [11]

    Trulens(TrueLens, 2024) Evaluating the over- lap between the retrieved documents and the generated response to detect hallucination by a LLM

  12. [12]

    P(True)(Kadavath et al., 2022) The paper de- tects hallucinations by having the model es- timate the probability that its own generated answer is correct, based on the key assumption that it is often easier for a model to recognize a correct answer than to generate one

  13. [13]

    SelfCheckGPT(Manakul et al., 2023) Self- CheckGPT detects hallucinations by checking for informational consistency across multiple stochastically sampled responses, based on the assumption that factual knowledge leads to consistent statements while hallucinations lead to divergent and contradictory ones

  14. [14]

    LN-Entropy(Malinin and Gales, 2021) This paper detects hallucinations by quantifying knowledge uncertainty, which it measures pri- marily with a novel metric called Reverse Mu- tual Information that captures the disagree- ment across an ensemble’s predictions, with high RMI indicating a likely hallucination

  15. [15]

    Energy(Liu et al., 2020) This paper detects hallucinations by using an energy score, de- rived directly from the model’s logits, as a more reliable uncertainty measure than soft- max confidence to identify out-of-distribution inputs that cause the model to hallucinate

  16. [16]

    Focus(Zhang et al., 2023) This paper detects hallucinations by calculating an uncertainty score focused on keywords, and then refines it by propagating penalties from unreliable con- text via attention and correcting token prob- abilities using entity types and inverse doc- ument frequency to mitigate both overconfi- dence and underconfidence

  17. [17]

    Perplexity(Ren et al., 2023) This paper de- tects hallucinations by separately measuring the Relative Mahalanobis Distance for both input and output embeddings, based on the as- sumption that in-domain examples will have embeddings closer to their respective fore- ground (in-domain) distributions than to a generic background distribution

  18. [18]

    REFCHECKER(Hu et al., 2024) It use a LLM to extract claim-triplets from a response and verify them by another LLM to detect hallucination

  19. [19]

    This method has two version: token level and chunk level

    REDEEP(Sun et al., 2025) It detects halluci- nation by analyzing the balance between the contributions from Copying Heads that pro- cess external context and Knowledge FFNs that inject internal knowledge, based on the finding that RAG hallucinations often arise from conflicts between these two sources. This method has two version: token level and chunk le...

  20. [20]

    NoVo(Ho et al., 2025) It leverages the L2 norms of specific attention heads as reliable indicators of truthfulness. By identifying a subset of truth-correlated heads from a small reference set, it employs a voting mechanism based on these head norms to detect hallu- cinations without requiring model parameter updates

  21. [21]

    TSV(Park et al., 2025) It introduces a lightweight steering vector to reshape the LLM’s latent space during inference. By ac- tively intervening to enhance the linear sepa- rability between truthful and hallucinated rep- resentations in the hidden states, it enables effective detection using a simple classifier on the steered embeddings. 13 Complexity Ana...

  22. [22]

    The bottleneck is the calculation of the global partition function (denominator) in Softmax

    Complete Probability Decomposition.To satisfy Theorem 1, we must compute the com- plete probability changes using the probe function Φ(h, y). The bottleneck is the calculation of the global partition function (denominator) in Softmax. • Mechanism:The probe function Φ(h, y) = Softmax(hWU)y = exp(w⊤ U,yh)P v∈V exp(w⊤ U,vh) re- quires projecting the hidden s...

  23. [23]

    Cost:O(T· |V| ·d)

    Global Components:For ∆Pinitial and ∆PLN, the probe is called once per gen- eration step. Cost:O(T· |V| ·d)

  24. [24]

    Summing over L layers, this costs O(L·T· |V| ·d)

    Layer Components:For ∆P (l) att and ∆P (l) ffn , the probe is invoked twice per layer (before and after the residual up- date). Summing over L layers, this costs O(L·T· |V| ·d). • Stage Complexity:Combining these terms, the dominant complexity is O(L·T· |V| ·d)

  25. [25]

    • Mechanism:This attribution requires project- ing the target token vector wU,y back into the hidden state space using the layer’s output projection matrixW (l) O ∈R d×d

    Head-wise Attribution.Once ∆P (l) att is ob- tained, we apportion it to individual heads based on their contribution to the target logit. • Mechanism:This attribution requires project- ing the target token vector wU,y back into the hidden state space using the layer’s output projection matrixW (l) O ∈R d×d. • Step Complexity:The calculation proceeds in tw...

  26. [26]

    Since W(l) O is a d×d matrix, this matrix-vector multiplication costsO(d 2)

    Projection:We compute the projected target vector g= (W (l) O )⊤wU,y. Since W(l) O is a d×d matrix, this matrix-vector multiplication costsO(d 2)

  27. [27]

    For H heads, this sums toO(d)

    Assignment:We distribute the contribu- tion to H heads by performing dot prod- ucts between the head outputs oh and the corresponding segments of g. For H heads, this sums toO(d). • Stage Complexity:The projection step (O(d2)) dominates the assignment step (O(d)). Integrating over L layers and T to- kens, the total complexity isO(L·T·d 2)

  28. [28]

    Mapping Attention to Input Sources.Fi- nally, we map head contributions to the four sources by aggregating attention weights A∈ RH×|s|×|s|. This involves two distinct sub-steps for each generated token at step t within a single layer: • Step 1: Summation.For each head h, we sum the attention weights corresponding to specific source indices (e.g.,I RAG): w...