arxiv: 2604.16430 · v1 · submitted 2026-04-06 · 💻 cs.CL · cs.AI

HalluSAE: Detecting Hallucinations in Large Language Models via Sparse Auto-Encoders

Boshui Chen , Zhaoxin Fan , Ke Wang , Zhiying Leng , Faguo Wu , Hongwei Zheng , Yifan Sun , Wenjun Wu This is my paper

Pith reviewed 2026-05-10 19:35 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords hallucination detectionsparse autoencoderslatent dynamicsphase transitionspotential energyfeature attributioncausal probinglarge language models

0 comments

The pith

Large language models hallucinate when their internal generation trajectory crosses identifiable energy thresholds that sparse autoencoders can track.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper frames hallucination as a dynamical process rather than a static output flaw, modeling each generation step as a point moving through a potential energy landscape derived from the model's activations. Sparse autoencoders decompose those activations so that sharp rises in a geometric energy metric mark the exact zones where factual errors begin. Once localized, contrastive logit methods tie the errors to particular high-energy features, and linear probes on the disentangled representations turn the detection into a causal check rather than a correlational one.

Core claim

Hallucinations correspond to phase-transition-like shifts in latent dynamics; these shifts are located by applying a geometric potential energy metric to sparse autoencoder features along the generation path, after which contrastive attribution isolates the responsible high-energy sparse features and probing confirms their causal role in factual errors.

What carries the argument

The geometric potential energy metric computed on sparse autoencoder activations, which identifies critical transition zones along the token-generation trajectory.

If this is right

Factual mistakes become traceable to specific sparse features rather than diffuse model behavior.
Detection moves from post-hoc output checking to real-time monitoring of latent trajectories.
Linear probes trained on the disentangled features yield causal rather than merely correlational signals.
The three-stage pipeline (zone localization, feature attribution, probing) can be applied at inference time without retraining the base model.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same energy-landscape view might extend to non-text generation tasks where errors also accumulate along a sequence.
If transition zones prove stable across prompts, they could serve as natural insertion points for corrective interventions during generation.
Training data that reduces the frequency or height of these energy spikes might lower hallucination rates without explicit alignment.

Load-bearing premise

Hallucinations reliably produce measurable high-energy spikes and phase-transition shifts that the sparse autoencoder decomposition can isolate from normal generation dynamics.

What would settle it

A controlled run on prompts known to trigger hallucinations in which the potential energy metric shows no distinct peaks at the error tokens while detection accuracy remains no better than random baselines.

Figures

Figures reproduced from arXiv: 2604.16430 by Boshui Chen, Faguo Wu, Hongwei Zheng, Ke Wang, Wenjun Wu, Yifan Sun, Zhaoxin Fan, Zhiying Leng.

**Figure 1.** Figure 1: Illustration of Phase Transition in LLM’s reasoning trajectories. The trajectories in potential energy space reveal three phases: early stability (Phase I), critical transition (Phase II, yellow highlight), and sustained error plateau (Phase III). Factual generation (blue) maintains low energy throughout, while hallucination (gradient color) undergoes abrupt energy increase (∆E) during the transition zone… view at source ↗

**Figure 2.** Figure 2: Illustration of the Exploratory Experimental Design. We conduct two complementary experiments: Exp 1 investigates layer-wise energy distribution by dividing 42 layers into Early/Middle/Late groups and comparing GPE differences; Exp 2 identifies microscopic feature-level contributions by analyzing differential features that exhibit high activation in hallucination samples but low activation in factual sam… view at source ↗

**Figure 3.** Figure 3: Experimental Results on Hallucination Dynamics. (a) Layer-wise energy distribution (grouped analysis). Grouped box plots reveal that hallucination samples’ GPE exhibits significant escalation from Early to Late layer groups (***p < 0.001). Error bars indicate 95% confidence intervals (bootstrap, n=200 per group). (b) Sparse feature contribution analysis. Cumulative energy contribution curve demonstrates t… view at source ↗

**Figure 5.** Figure 5: Illustration of Layer-wise Energy Difference Evolution Across All 42 Layers. Geometric Potential Energy (GPE) difference (Hallucination - Factual) reveals three distinct phases: stable period (L0–22, near-zero difference with random fluctuations), transition zone (L23–35, sharp 20.7-fold escalation highlighted by shaded region), and plateau period (L36–41, sustained high-energy state). Error bars repres… view at source ↗

**Figure 4.** Figure 4: Examples of Samples with High-Activation on Identified Sparse Features. Each feature exhibits stable and interpretable error patterns: L27-9659 shows fixed year substitution (always outputs “1993”), L28-87984 demonstrates numerical shift patterns (+4 years, -0.2 billion), L23-71479 drives domain-specific entity confusion (Tesla/Edison, Newton/Einstein), and L24-35793 triggers cross-domain celebrity subst… view at source ↗

**Figure 5.1.** Figure 5.1: Pareto Distribution of Feature Contributions [PITH_FULL_IMAGE:figures/full_fig_p008_5_1.png] view at source ↗

**Figure 6.** Figure 6: further characterizes the distribution of feature importance as measured by C-DLA scores. The cumulative attribution curve exhibits a pronounced elbow: the top 0.1% of features (131 out of 131,072) account for 41.1% of total attribution strength, while the top 1% (1,310 features) cover 62.5%. The Gini coefficient of 0.912—markedly higher than the random baseline of 0.414—confirms that hallucination is go… view at source ↗

**Figure 7.** Figure 7: Data Cleaning Pipeline. Five-stage workflow from raw data to curated dataset. Dataset Statistics and Quality Control [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗

**Figure 8.** Figure 8: GPT-4o Annotation Prompt. Complete template with explicit grounding, numerical tolerance, structured reasoning, and JSON output. Key Implementation Details. Uncertainty methods: LN-Entropy computes entropy on final logits with lognormalization; Semantic Entropy clusters 10 nucleus-sampled outputs using Sentence-BERT embeddings and agglomerative clustering. Consistency methods: Lexical Similarity computes … view at source ↗

read the original abstract

Large Language Models (LLMs) are powerful and widely adopted, but their practical impact is limited by the well-known hallucination phenomenon. While recent hallucination detection methods have made notable progress, we find most of them overlook the dynamic nature and underlying mechanisms of it. To address this gap, we propose HalluSAE, a phase transition-inspired framework that models hallucination as a critical shift in the model's latent dynamics. By modeling the generation process as a trajectory through a potential energy landscape, HalluSAE identifies critical transition zones and attributes factual errors to specific high-energy sparse features. Our approach consists of three stages: (1) Potential Energy Empowered Phase Zone Localization via sparse autoencoders and a geometric potential energy metric; (2) Hallucination-related Sparse Feature Attribution using contrastive logit attribution; and (3) Probing-based Causal Hallucination Detection through linear probes on disentangled features. Extensive experiments on Gemma-2-9B demonstrate that HalluSAE achieves state-of-the-art hallucination detection performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

read the letter

HalluSAE frames hallucinations as phase transitions in an SAE-derived potential energy landscape and claims SOTA detection on Gemma-2-9B, but the abstract supplies no numbers or baselines to evaluate that claim. The new element is the three-stage pipeline: potential-energy phase zone localization with sparse autoencoders and a geometric metric, contrastive logit attribution to tie errors to high-energy sparse features, and linear probes on disentangled features for final detection. This setup tries to treat generation as a trajectory and link hallucinations to internal dynamics rather than post-hoc output checks, which is a reasonable direction given how many detectors stay at the surface level. The paper does a clean job outlining a coherent pipeline that directly addresses the dynamic aspect most prior work overlooks. The soft spots sit mainly in the missing evidence. The abstract asserts state-of-the-art results and extensive experiments without any accuracy figures, dataset descriptions, baseline comparisons, or ablation results, so the central performance claim cannot be assessed from the text provided. The potential energy landscape is introduced as a modeling choice without visible independent grounding or derivation, which leaves open whether it is a natural fit or an ad-hoc construct that happens to separate the cases. If the full paper contains solid, reproducible numbers and checks on the phase-transition assumption, those gaps close; otherwise the method risks being another fitted detector. This work is aimed at the LLM interpretability and safety crowd, especially readers already using SAEs who want a mechanistic handle on generation failures. Someone in that group would get usable ideas from the framing and pipeline even if they treat the results as provisional. It deserves peer review because the approach is distinct enough and the problem is important; referees can check the experiments and see whether the evidence backs the story.

Referee Report

2 major / 2 minor

Summary. The paper proposes HalluSAE, a phase transition-inspired framework for detecting hallucinations in LLMs. It models generation trajectories through a potential energy landscape using sparse autoencoders to localize critical transition zones, attributes factual errors to high-energy sparse features via contrastive logit attribution, and applies linear probes on disentangled features for causal detection. Experiments on Gemma-2-9B are claimed to achieve state-of-the-art hallucination detection performance.

Significance. If the results hold, the work could advance mechanistic interpretability by framing hallucinations as identifiable shifts in latent dynamics rather than isolated output errors. The integration of SAEs with a geometric energy metric for feature attribution offers a structured pipeline that may enable more targeted debugging of LLM internals. No mention of open code, reproducible artifacts, or machine-checked proofs is present, but the three-stage design is coherent.

major comments (2)

[Abstract] Abstract: the claim of state-of-the-art performance on Gemma-2-9B is asserted without any reported metrics, baselines, dataset statistics, or ablation results, rendering the central empirical contribution impossible to evaluate from the provided text.
[§3.1] §3.1: the geometric potential energy metric used for phase zone localization is introduced without a formal derivation, grounding in LLM dynamics, or comparison to alternative formulations; this choice is load-bearing for the phase-transition interpretation and the subsequent attribution to high-energy features.

minor comments (2)

[§3] A figure or pseudocode block clarifying how SAE activations are mapped to the potential energy landscape would improve readability of the first stage.
[Notation] Notation for sparse features, energy values, and contrastive logits should be consolidated in a symbol table to avoid ambiguity across sections.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments on the abstract and the potential energy metric. We address each major point below and commit to revisions that strengthen the manuscript without altering its core claims.

read point-by-point responses

Referee: [Abstract] Abstract: the claim of state-of-the-art performance on Gemma-2-9B is asserted without any reported metrics, baselines, dataset statistics, or ablation results, rendering the central empirical contribution impossible to evaluate from the provided text.

Authors: We agree that the abstract should be more self-contained to allow immediate evaluation of the empirical claims. The full paper reports detailed results in Section 4, including accuracy, F1, and AUROC metrics on standard hallucination benchmarks (e.g., TruthfulQA and HaluEval subsets) with dataset sizes and baseline comparisons (e.g., against logit-based and representation-based detectors). In the revised version we will expand the abstract to include the key quantitative results supporting the SOTA claim, along with brief dataset and ablation summaries. revision: yes
Referee: [§3.1] §3.1: the geometric potential energy metric used for phase zone localization is introduced without a formal derivation, grounding in LLM dynamics, or comparison to alternative formulations; this choice is load-bearing for the phase-transition interpretation and the subsequent attribution to high-energy features.

Authors: We acknowledge that a more explicit derivation would improve clarity and rigor. The metric is constructed from the SAE reconstruction error combined with a sparsity penalty, motivated by viewing next-token generation as motion in a latent energy landscape where high reconstruction error signals instability. In the revision we will add a formal derivation in §3.1 that connects the metric to the model's cross-entropy loss surface, provide the explicit formula with all terms defined, and include a short comparison to alternative formulations such as token-level entropy or gradient-norm energy proxies, explaining why the chosen geometric form best aligns with the phase-transition framing. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper describes an empirical framework (HalluSAE) with three stages: SAE-based potential-energy phase zone localization, contrastive feature attribution, and linear-probe detection. No equations, first-principles derivations, or predictions are presented that reduce by construction to fitted inputs, self-citations, or ansatzes. The modeling choices (potential energy landscape, phase transitions) are introduced as interpretive tools rather than derived results, and performance claims rest on experimental validation on Gemma-2-9B rather than any self-referential reduction. The argument structure is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Based solely on the abstract, the framework rests on the modeling assumption that LLM generation can be usefully represented as a trajectory in a potential energy landscape whose high-energy zones correspond to hallucinations; sparse autoencoders are treated as a standard tool whose features can be causally linked to factual errors via probes. No numerical free parameters are named.

axioms (1)

domain assumption Hallucinations manifest as critical phase transitions in the model's latent dynamics
Core premise stated in the abstract that justifies the potential-energy localization stage.

invented entities (1)

potential energy landscape for LLM generation trajectories no independent evidence
purpose: To localize critical transition zones where hallucinations occur
Introduced as the central modeling device; no external validation or falsifiable prediction outside the detection task is mentioned.

pith-pipeline@v0.9.0 · 5506 in / 1335 out tokens · 58325 ms · 2026-05-10T19:35:21.881915+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

modeling the generation process as a trajectory through a potential energy landscape... geometric potential energy metric E(l,t)=∥SAE(r^l_t)−μ_truth∥²_2
IndisputableMonolith/Foundation/DimensionForcing.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

phase transition zones... exponential energy growth... high-energy sparse features

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages

[1]

Entity Verification: Check if core entities (names, places, dates) match

work page
[2]

20%" vs

Numerical Precision: Allow minor formatting differences (e.g., "20%" vs "20 percent"), but mark significant deviations as incorrect

work page
[3]

Contradiction Check: If response contradicts reference knowledge, mark INCORRECT

work page
[4]

Step-by-Step Reasoning:

Relevance: If response is irrelevant or incomplete, mark INCORRECT. Step-by-Step Reasoning:

work page
[5]

Identify the key claim in [Ground Truth]

work page
[6]

Extract the corresponding claim from [Model Response]

work page
[7]

Compare and explicitly state discrepancies

work page
[8]

reasoning

Determine final verdict. Output Format (JSON): { "reasoning": "Concise explanation highlighting specific errors if any.", "label": "CORRECT" or "INCORRECT" } Figure 8.GPT-4o Annotation Prompt.Complete template with explicit grounding, numerical tolerance, structured reasoning, and JSON output. Key Implementation Details. Uncertainty methods:LN-Entropy com...

work page
[9]

Split the 1,260 training samples into 1,008 train and 252 validation samples

work page
[10]

Extract 100-dimensional feature vectors from the transition zone using the pre-trained SAE

work page
[11]

Standardize features using the training set statistics

work page
[12]

Train Logistic Regression with candidateCvalues

work page
[13]

Table 16 summarizes the complete configuration

Evaluate on the validation set and record AUC After identifying the optimal C, we retrain the final detector on the full training set (1,260 samples) and evaluate on the held-out test set (360 samples). Table 16 summarizes the complete configuration. Table 16.Complete Detector Configuration Component Configuration Model Type Logistic Regression (L1 regula...

work page