Where Fake Citations Are Made: Tracing Field-Level Hallucination to Specific Neurons in LLMs
Pith reviewed 2026-05-10 04:18 UTC · model grok-4.3
The pith
LLMs contain sparse field-specific neurons that causally drive citation hallucinations, which can be amplified to increase errors or suppressed to reduce them.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
A sparse set of field-specific hallucination neurons (FH-neurons) is identified in Qwen2.5-32B-Instruct via elastic-net regularization and stability selection applied to neuron CETT values. These neurons are shown to be causally linked to citation errors because amplifying their activations increases hallucination rates while suppressing them improves accuracy across citation fields.
What carries the argument
Field-specific hallucination neurons (FH-neurons), a sparse subset of neurons selected by elastic-net regularization with stability selection on CETT values, which causally modulate the rate of fabricated citations when their activations are scaled up or down.
If this is right
- Author names are the most error-prone field in LLM-generated citations across models and styles.
- Hallucination detection signals learned for one citation field transfer poorly to other fields.
- Suppressing the identified neurons improves citation accuracy in multiple fields without requiring retraining or external knowledge.
- Models trained with reasoning-oriented distillation show degraded citation recall compared with base models.
Where Pith is reading between the lines
- The same neuron-location technique could be tested on other factual error types such as invented facts or incorrect dates to check for modularity.
- Targeted editing of these neurons might allow domain-specific hallucination reduction without affecting overall model capability.
- Whether these neurons overlap with circuits handling general factuality would clarify if citation errors are a distinct internal mechanism.
Load-bearing premise
The neurons chosen by the regularization procedure are the actual cause of the field-specific hallucinations rather than simply correlated with them or with other aspects of text generation.
What would settle it
A replication in which suppressing the selected neurons produces no reduction in citation hallucination rates, or amplifying them produces no increase, would break the claimed causal connection.
Figures
read the original abstract
LLMs frequently generate fictitious yet convincing citations, often expressing high confidence even when the underlying reference is wrong. We study this failure across 9 models and 108{,}000 generated references, and find that author names fail far more often than other fields across all models and settings. Citation style has no measurable effect, while reasoning-oriented distillation degrades recall. Probes trained on one field transfer at near-chance levels to the others, suggesting that hallucination signals do not generalize across fields. Building on this finding, we apply elastic-net regularization with stability selection to neuron-level CETT values of Qwen2.5-32B-Instruct and identify a sparse set of field-specific hallucination neurons (FH-neurons). Causal intervention further confirms their role: amplifying these neurons increases hallucination, while suppressing them improves performance across fields, with larger gains in some fields. These results suggest a lightweight approach to detecting and mitigating citation hallucination using internal model signals alone.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper studies citation hallucinations in LLMs across 9 models and 108,000 generated references, finding that author names fail at higher rates than other fields regardless of citation style. Reasoning-oriented distillation reduces recall, and field-specific probes transfer at near-chance levels, indicating non-generalizable hallucination signals. In Qwen2.5-32B-Instruct, elastic-net regularization with stability selection on neuron-level CETT values identifies sparse field-specific hallucination neurons (FH-neurons). Causal interventions show that amplifying these neurons increases hallucination rates while suppressing them improves citation performance across fields.
Significance. If the causal specificity of the FH-neurons holds after appropriate controls, the work provides a mechanistic, neuron-level account of field-specific citation hallucination and a lightweight internal-signal approach to detection and mitigation. The scale of the empirical evaluation (9 models, 108k references) and the use of causal interventions rather than purely correlational analysis are strengths that could inform targeted model editing techniques for knowledge-intensive generation tasks.
major comments (3)
- [Causal intervention experiments] The causal intervention results (amplification increases hallucination; suppression improves performance) lack controls comparing effects to randomly sampled neuron sets or to high-CETT neurons excluded by stability selection. Without these baselines, the directional changes cannot be attributed specifically to FH-neurons rather than any sparse perturbation of generation circuitry.
- [Neuron selection and FH-neuron identification] The elastic-net regularization and stability selection procedure for identifying FH-neurons (applied to CETT values) involves free hyperparameters whose sensitivity is not reported; combined with post-selection on the same data used for evaluation, this creates a risk that the sparse set is overfit to observed hallucination patterns rather than causally responsible.
- [Results on interventions and probe transfer] The abstract and results sections provide no error bars, confidence intervals, or exact intervention protocols (e.g., scaling factors, number of neurons intervened, layer locations) for the reported hallucination rate changes; this weakens the ability to assess the magnitude and reliability of the claimed improvements.
minor comments (2)
- [Methods] Clarify the exact definition and computation of CETT values in the methods section, including any preprocessing or normalization steps.
- [Observational results] The claim that 'citation style has no measurable effect' should be supported by explicit statistical tests or effect sizes rather than a qualitative statement.
Simulated Author's Rebuttal
We thank the referee for their thoughtful and constructive comments on our manuscript. We address each major comment point by point below, indicating the revisions we will make to strengthen the paper.
read point-by-point responses
-
Referee: [Causal intervention experiments] The causal intervention results (amplification increases hallucination; suppression improves performance) lack controls comparing effects to randomly sampled neuron sets or to high-CETT neurons excluded by stability selection. Without these baselines, the directional changes cannot be attributed specifically to FH-neurons rather than any sparse perturbation of generation circuitry.
Authors: We agree that additional control experiments are required to establish the specificity of the FH-neurons. In the revised manuscript, we will add causal interventions on randomly sampled neuron sets matched for sparsity and on high-CETT neurons excluded by stability selection, using identical amplification and suppression protocols. These baselines will be reported alongside the original results to demonstrate that the observed changes are attributable to the selected FH-neurons rather than generic sparse perturbations. revision: yes
-
Referee: [Neuron selection and FH-neuron identification] The elastic-net regularization and stability selection procedure for identifying FH-neurons (applied to CETT values) involves free hyperparameters whose sensitivity is not reported; combined with post-selection on the same data used for evaluation, this creates a risk that the sparse set is overfit to observed hallucination patterns rather than causally responsible.
Authors: We acknowledge the need for greater transparency on hyperparameters and data usage. The revised methods section will fully document all hyperparameters for elastic-net regularization and stability selection, including their specific values and selection rationale. We will also include a sensitivity analysis showing the stability of the identified neuron set across a range of hyperparameter values. To address post-selection concerns, we will explicitly document the data partitioning procedure used for CETT computation and subsequent interventions, clarifying any separation between selection and evaluation data. revision: yes
-
Referee: [Results on interventions and probe transfer] The abstract and results sections provide no error bars, confidence intervals, or exact intervention protocols (e.g., scaling factors, number of neurons intervened, layer locations) for the reported hallucination rate changes; this weakens the ability to assess the magnitude and reliability of the claimed improvements.
Authors: We regret these omissions from the main text. In the revision, we will add error bars (standard errors across runs) and 95% confidence intervals to all reported hallucination rates in the results section and figures. We will also expand the methods section with complete intervention protocols, including exact scaling factors, number of neurons per intervention, layer locations, and other parameters. These details will be moved from the appendix into the main body for clarity and reproducibility. revision: yes
Circularity Check
No significant circularity detected
full rationale
The paper identifies FH-neurons via elastic-net and stability selection on CETT values computed from generated references, then validates via causal interventions (amplification/suppression) that alter hallucination rates. These steps depend on external benchmarks (108k generated references across models) and direct interventions rather than any equation or selection procedure that reduces the outcome to the input data by construction. No self-citations are load-bearing for the core claim, no ansatz is smuggled, and no renaming of known results occurs. The derivation chain remains self-contained against the experimental data.
Axiom & Free-Parameter Ledger
free parameters (1)
- elastic-net regularization strength and stability selection threshold
axioms (2)
- domain assumption Causal interventions on individual neurons produce interpretable changes in output behavior without compensatory effects from the rest of the network.
- domain assumption Probe transfer failure across fields indicates genuinely field-specific mechanisms rather than insufficient probe capacity.
Reference graph
Works this paper leans on
-
[1]
Causal inference and treatment effect estima- tion
-
[2]
Bayesian optimization and probabilistic mod- eling
-
[3]
Reinforcement learning and policy optimiza- tion
-
[4]
Robustness and adversarial machine learning
-
[5]
Explainable AI and interpretability methods
-
[6]
Representation learning and self-supervised learning
-
[7]
Graph neural networks and graph learning
-
[8]
Federated learning and distributed ML
-
[9]
Fairness and bias mitigation in ML
-
[10]
ML evaluation, benchmarks, and repro- ducibility
-
[11]
Information extraction and relation extrac- tion
-
[12]
Question answering and retrieval-augmented generation
-
[13]
Machine translation and multilingual NLP
-
[14]
Summarization and factual consistency
-
[15]
Dialogue systems and conversational agents
-
[16]
Prompting and instruction tuning methods
-
[17]
NLP robustness, safety, and alignment
-
[18]
Text generation evaluation and metrics
-
[19]
Knowledge grounding and entity linking
-
[20]
Long-context modeling and efficient attention
-
[21]
Image classification and representation learn- ing 11
-
[22]
Object detection and instance segmentation
-
[23]
Vision transformers and efficient vision mod- els
-
[24]
3D vision and point cloud understanding
-
[25]
Visual question answering and vision- language models
-
[26]
Video understanding and temporal action recognition
-
[27]
Generative vision models and diffusion
-
[28]
Self-supervised learning for vision
-
[29]
Vision robustness and adversarial attacks
-
[30]
Medical imaging and computer-aided diagno- sis
-
[31]
Distributed systems and consensus protocols
-
[32]
Cloud computing and serverless architectures
-
[33]
Storage systems and key-value stores
-
[34]
Operating systems scheduling and resource management
-
[35]
Compilers and code optimization
-
[36]
Systems performance modeling and profiling
-
[37]
Datacenter networking and traffic engineering
-
[38]
GPU systems and ML systems optimization
-
[39]
Fault tolerance and reliability engineering
-
[40]
Observability and telemetry systems
-
[41]
Network security and intrusion detection
-
[42]
Cryptography and secure computation
-
[43]
Privacy-preserving data analysis and differen- tial privacy
-
[44]
Malware analysis and reverse engineering
-
[45]
Web security and authentication protocols
-
[46]
Database query optimization and indexing
-
[47]
Data integration and entity resolution
-
[48]
Human-computer interaction and usability studies
-
[49]
Program analysis and static/dynamic analy- sis
-
[50]
Approximation algorithms and randomized algorithms D Output Volume Analysis To examine whether citation order within a prompt affects reliability, we analyze hallucina- tion rate as a function of citation position across all models and generation volumes. As shown in Figure 8, the first two citations exhibit notably lower hallucination rates, after which ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.