arxiv: 2605.07209 · v1 · submitted 2026-05-08 · 💻 cs.CL · cs.AI· cs.LG

Recognition: no theorem link

Hallucination Detection via Activations of Open-Weight Proxy Analyzers

Akshita Singh , Prabesh Paudel , Siddhartha Roy

Authors on Pith no claims yet

Pith reviewed 2026-05-11 02:39 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords hallucination detectionproxy analyzertransformer activationsopen-weight modelsRAGTruthstacking ensembletoken-level AUCLLM outputs

0 comments

The pith

Small open-weight models detect hallucinations in any LLM output by reading their own internal activations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a proxy-analyzer method that passes already-generated text through a separate small open-weight model and extracts features from that model's transformer activations to spot hallucinations. Eighteen features are defined from residual norms, attention patterns, entropy, MLP layers, and new grounding statistics, then combined in a stacking ensemble trained on over 72,000 samples. The approach requires no access to the original generator's states or weights and works for both open models and closed APIs. Across seven analyzers from 0.5B to 9B parameters it improves token-level AUC by 7.4 to 10.3 points over the ReDeEP baseline on RAGTruth while showing nearly identical performance regardless of analyzer size.

Core claim

The central claim is that hallucinations in text generated by any large language model can be detected by analyzing the internal activations of an independent open-weight proxy reader. Features drawn from the reader's residual stream, per-head attention to source documents, entropy, MLP activations, logit-lens trajectories, and three new token grounding statistics are fed into a stacking ensemble. This yields consistent AUC gains of 7.4 to 10.3 percentage points over ReDeEP on RAGTruth across seven different analyzers, with tight clustering of results and no requirement that the proxy match the generator's family or size.

What carries the argument

Eighteen features extracted from the proxy reader's transformer activations, including residual stream norms, source-document attention, entropy, MLP activations, logit-lens trajectories, and three token-level grounding statistics, combined via a stacking ensemble classifier.

If this is right

Hallucination checks become possible for closed API models without any internal access to the generator.
Detection runs locally on small hardware since the analyzer need not match the generator's size.
Training on multi-family datasets allows the detector to generalize across different LLM producers.
Within-family results show that larger analyzers do not always outperform smaller ones.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Real-time output filtering in applications could use a lightweight local proxy without sending data back to the generator provider.
The same activation-reading approach might be tested on other generation problems such as factual inconsistency or repetition.
The observed size-independence suggests the hallucination signal lives in low-level processing patterns rather than requiring high model capacity.

Load-bearing premise

Activations inside the proxy analyzer reliably signal whether the input text contains hallucinations even when the text was produced by a different model family.

What would settle it

A clear drop in AUC below the ReDeEP baseline when the same ensemble is tested on hallucinated outputs from a generator architecture and domain completely absent from the original training data.

Figures

Figures reproduced from arXiv: 2605.07209 by Akshita Singh, Prabesh Paudel, Siddhartha Roy.

**Figure 2.** Figure 2: Per-generator RAGTruth AUC for Gemma-2-9B. Dashed lines mark the ReDeEP baselines. [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Layer by task-type discriminability for Gemma-2-9B. Brighter cells mean stronger discrim [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: Left: each dot is one of 19 signals, plotted by test AUC against AggreFact AUC for [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗

read the original abstract

We introduce a proxy-analyzer framework for detecting hallucinations in large language models. Instead of looking inside the generating model, our system reads already-generated text through a small locally hosted open-weight model and spots hallucinations using the reader's own internal activations. This works just as well when the generator is a closed API like GPT-4 as when it is any open-weight model. We built eighteen features grounded in how transformers process text, covering residual stream norms, per-head source-document attention, entropy, MLP activations, logit-lens trajectories, and three new token-level grounding statistics. We trained a stacking ensemble on 72,135 samples from five hallucination datasets. We tested across seven analyzer architectures from 0.5 billion to 9 billion parameters: Qwen2.5 at 0.5B and 7B, Gemma-2 at 2B and 9B, Pythia at 1.4B, and LLaMA-3 at both 3B and 8B. Across all seven, we consistently beat ReDeEP's token-level AUC of 0.73 on RAGTruth by 7.4 to 10.3 percentage points. Qwen2.5-7B reached an F1 of 0.717, just above ReDeEP's 0.713, while Qwen2.5-0.5B hit 0.706. The most striking finding is how tightly all seven models cluster: AUC spans only 2.3 percentage points across an eighteen-fold difference in model size. Even more surprising, our 3B LLaMA outperforms our 8B LLaMA on RAGTruth, showing that bigger is not always better even within the same model family. Both RAGTruth and LLM-AggreFact include outputs from multiple LLM families, so our results are not skewed toward any particular generator.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Proxy analyzers give consistent hallucination detection gains with size-independent performance, but need stronger tests for unseen generators.

read the letter

This paper's proxy-analyzer setup with activation features from small open models detects hallucinations better than ReDeEP and shows surprisingly consistent results across a wide range of proxy sizes. The new elements are the framing of using a separate reader model whose activations are probed, the specific 18 features that include residual norms, attention to source, entropy, MLP, logit lens, and three new grounding stats, plus the stacking ensemble. They train on 72k samples and test on seven proxies from 0.5B to 9B, getting AUC gains of 7-10 points on RAGTruth over the 0.73 baseline. The clustering of performance is interesting, with the smallest and some mid-size models doing nearly as well as larger ones, and even the 3B LLaMA beating the 8B in their family. It does well in providing a method that can work without generator access, which addresses a real need for closed APIs. The fact that results hold across multiple datasets with mixed generator families is a plus. The soft spots are mainly in the reporting: no error bars, no hyperparameter details, and feature definitions are high-level in the abstract. More importantly, while the datasets have outputs from various families, the concern about the ensemble potentially learning generator-specific patterns rather than general hallucination cues is worth probing. The paper claims it works for closed models like GPT-4, but without explicit hold-out tests on generators completely absent from training, that part of the generalization is not fully demonstrated yet. The central performance numbers are plausible but would benefit from more rigorous statistical backing. This work is for researchers and engineers focused on post-hoc hallucination detection in LLM pipelines, particularly those wanting something that runs locally on a proxy without needing the original model's weights. A reader looking for empirical comparisons in this area would find the size-invariance result useful. I think it deserves a serious referee because the technique is novel enough and the results point to a practical direction, even if more validation on the generalization claim is required.

Referee Report

3 major / 2 minor

Summary. The paper introduces a proxy-analyzer framework for hallucination detection that extracts 18 features (residual stream norms, per-head source-document attention, entropy, MLP activations, logit-lens trajectories, and three new token-level grounding statistics) from the internal activations of small open-weight models applied to already-generated text. A stacking ensemble is trained on 72,135 samples from five hallucination datasets and evaluated across seven proxy analyzers (Qwen2.5 0.5B/7B, Gemma-2 2B/9B, Pythia 1.4B, LLaMA-3 3B/8B). The central empirical claim is that all seven proxies consistently outperform ReDeEP's token-level AUC of 0.73 on RAGTruth by 7.4–10.3 points, with F1 scores of 0.706–0.717, and that performance clusters tightly across an 18-fold size range; the method is asserted to work equally for closed-API generators such as GPT-4.

Significance. If the performance claims and generalization hold after the requested clarifications, the work supplies a practical, generator-agnostic detection method that requires only a locally hosted open-weight reader and no access to the original model's states or training data. The observation that 0.5B–3B proxies perform on par with 7B–9B models (and that the 3B LLaMA-3 outperforms the 8B variant on RAGTruth) is noteworthy for efficiency. The use of multiple datasets containing outputs from several families provides some support for the claim of robustness beyond any single generator.

major comments (3)

[§3 and §4] §3 (Feature Extraction) and §4 (Training): the 18 features are described at a high level but lack explicit mathematical definitions, extraction code, or hyperparameter values (learning rate, ensemble meta-learner architecture, regularization). Without these, the reported AUC/F1 gains cannot be independently reproduced or verified as arising from the claimed activation-based signals rather than implementation-specific choices.
[§5 and Tables 1–2] §5 (Results) and Table 1/2: no error bars, standard deviations across runs, or statistical significance tests accompany the AUC improvements of 7.4–10.3 points or the F1 values. The central claim that the method “consistently beats” ReDeEP therefore rests on point estimates whose reliability cannot be assessed from the reported data.
[§5.3 and §2] §5.3 (Generalization) and §2 (Datasets): although RAGTruth and LLM-AggreFact contain outputs from multiple families, the manuscript provides no explicit cross-family hold-out experiment (e.g., training only on LLaMA-family generations and testing on GPT-4 or Qwen generations). This leaves open the possibility that the stacking ensemble exploits generator-specific stylistic or attention artifacts rather than pure hallucination signals, directly affecting the claim that the approach works for unseen closed-API generators.

minor comments (2)

[§5] The ReDeEP baseline is referenced but its exact token-level AUC computation and feature set are not restated; a short comparison table would improve clarity.
[Figure 3] Figure 3 (performance vs. model size) would benefit from error bars or multiple random seeds to visualize the tightness of the 2.3-point AUC cluster.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thorough review and constructive feedback. We believe the suggested revisions will strengthen the manuscript and address the concerns regarding reproducibility, statistical reporting, and generalization. Below we provide point-by-point responses to the major comments.

read point-by-point responses

Referee: [§3 and §4] §3 (Feature Extraction) and §4 (Training): the 18 features are described at a high level but lack explicit mathematical definitions, extraction code, or hyperparameter values (learning rate, ensemble meta-learner architecture, regularization). Without these, the reported AUC/F1 gains cannot be independently reproduced or verified as arising from the claimed activation-based signals rather than implementation-specific choices.

Authors: We agree that detailed specifications are essential for reproducibility. In the revised version, we will provide explicit mathematical definitions for all 18 features, including formulas for residual stream norms, per-head attention, entropy, MLP activations, logit-lens trajectories, and the three new token-level grounding statistics. We will also report the exact hyperparameter settings for the stacking ensemble, such as the learning rate, the architecture of the meta-learner, and regularization parameters. Furthermore, we commit to releasing the feature extraction and training code upon publication to allow independent verification. revision: yes
Referee: [§5 and Tables 1–2] §5 (Results) and Table 1/2: no error bars, standard deviations across runs, or statistical significance tests accompany the AUC improvements of 7.4–10.3 points or the F1 values. The central claim that the method “consistently beats” ReDeEP therefore rests on point estimates whose reliability cannot be assessed from the reported data.

Authors: We recognize the value of statistical rigor in reporting results. We will revise the results section and tables to include error bars representing standard deviations from multiple independent runs (e.g., with different random seeds for training). Additionally, we will conduct and report statistical significance tests, such as paired t-tests or Wilcoxon tests, comparing our method's performance against ReDeEP to substantiate the improvements of 7.4–10.3 AUC points. revision: yes
Referee: [§5.3 and §2] §5.3 (Generalization) and §2 (Datasets): although RAGTruth and LLM-AggreFact contain outputs from multiple families, the manuscript provides no explicit cross-family hold-out experiment (e.g., training only on LLaMA-family generations and testing on GPT-4 or Qwen generations). This leaves open the possibility that the stacking ensemble exploits generator-specific stylistic or attention artifacts rather than pure hallucination signals, directly affecting the claim that the approach works for unseen closed-API generators.

Authors: This is a valid concern for establishing true generalization. To address it, we will perform additional cross-family hold-out experiments in the revised manuscript. Specifically, we will train the ensemble on subsets of the data from specific generator families (e.g., LLaMA) and evaluate on held-out generations from other families (e.g., GPT-4 and Qwen) present in RAGTruth and LLM-AggreFact. The results of these experiments will be added to §5.3 to provide direct evidence supporting the generator-agnostic claim. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical feature extraction and supervised ensemble

full rationale

The paper describes an empirical pipeline that extracts 18 activation-based features from open-weight proxy models, trains a stacking ensemble on 72,135 labeled hallucination samples, and reports AUC/F1 on held-out test sets (RAGTruth, LLM-AggreFact). No equations, derivations, or self-citations appear in the provided text; performance numbers are obtained by direct evaluation rather than by re-expressing fitted parameters as predictions. The method is self-contained against external benchmarks and does not reduce any claim to a definitional identity or load-bearing self-citation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract provides no explicit free parameters, axioms, or invented entities; the approach rests on standard transformer activation extraction and supervised ensemble learning.

pith-pipeline@v0.9.0 · 5659 in / 1251 out tokens · 61846 ms · 2026-05-11T02:39:14.092517+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

22 extracted references · 22 canonical work pages · 2 internal anchors

[1]

and Mitchell, T

Azaria, A. and Mitchell, T. (2023). The internal state of an LLM knows when it’s lying.EMNLP Findings

work page 2023
[2]

Chen, C. et al. (2024). INSIDE: LLMs’ internal states retain the power of hallucination detection. ICLR 2024

work page 2024
[3]

Chuang, Y .-S. et al. (2024). Lookback lens: Detecting and mitigating contextual hallucinations in large language models using only attention maps.arXiv:2407.07071

work page arXiv 2024
[4]

Elhage, N. et al. (2021). A mathematical framework for transformer circuits.Transformer Circuits Thread

work page 2021
[5]

Es, S. et al. (2023). RAGAS: Automated evaluation of retrieval augmented generation. arXiv:2309.15217

work page arXiv 2023
[6]

Farquhar, S. et al. (2024). Detecting hallucinations in large language models using semantic entropy.Nature 630

work page 2024
[7]

Geva, M. et al. (2021). Transformer feed-forward layers are key-value memories.EMNLP 2021

work page 2021
[8]

Hernandez, E. et al. (2024). Mechanistic understanding and mitigation of language model non-factual hallucinations.arXiv:2403.18167

work page arXiv 2024
[9]

Ji, Z. et al. (2023). Survey of hallucination in natural language generation.ACM Computing Surveys 56(12)

work page 2023
[10]

Li, J. et al. (2023). HaluEval: A large-scale hallucination evaluation benchmark.EMNLP 2023

work page 2023
[11]

Lytang, C. et al. (2023). LLM-AggreFact: A unified benchmark for hallucination detection. arXiv

work page 2023
[12]

Manakul, P. et al. (2023). SelfCheckGPT: Zero-resource black-box hallucination detection. EMNLP 2023

work page 2023
[13]

The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets

Marks, S. and Tegmark, M. (2023). The geometry of truth: Emergent linear structure in large language model representations of true/false datasets.arXiv:2310.06824

work page internal anchor Pith review arXiv 2023
[14]

Meng, K. et al. (2022). Locating and editing factual associations in GPT.NeurIPS 2022. 11

work page 2022
[15]

Nie, Y . et al. (2020). Adversarial NLI: A new benchmark for natural language understanding. ACL 2020

work page 2020
[16]

Interpreting GPT: The logit lens.AI Alignment F orum

nostalgebraist (2020). Interpreting GPT: The logit lens.AI Alignment F orum

work page 2020
[17]

Pandit, S. et al. (2025). MedHallu: A comprehensive benchmark for detecting medical halluci- nations.arXiv:2502.14302

work page arXiv 2025
[18]

Sun, Z. et al. (2025). ReDeEP: Detecting hallucination in RAG systems via mechanistic interpretability.ICLR 2025

work page 2025
[19]

Tang, L. et al. (2024). MiniCheck: Efficient fact-checking of LLMs on grounding documents. EMNLP 2024

work page 2024
[20]

HHEM-2.1-Open: Hughes Hallucination Evaluation Model.HuggingFace: vectara/hallucination_evaluation_model

Vectara (2023). HHEM-2.1-Open: Hughes Hallucination Evaluation Model.HuggingFace: vectara/hallucination_evaluation_model

work page 2023
[21]

Wang, K. et al. (2022). Interpretability in the wild: A circuit for indirect object identification in GPT-2.arXiv:2211.00593

work page internal anchor Pith review arXiv 2022
[22]

Wu, C. et al. (2024). RAGTruth: A hallucination corpus for developing trustworthy RAG language models.ACL 2024. 12

work page 2024