Where Fake Citations Are Made: Tracing Field-Level Hallucination to Specific Neurons in LLMs

Ruixiang Tang; Xiaodong Lin; Yihao Quan; Yuefei Chen

arxiv: 2604.18880 · v1 · submitted 2026-04-20 · 💻 cs.CL · cs.AI

Where Fake Citations Are Made: Tracing Field-Level Hallucination to Specific Neurons in LLMs

Yuefei Chen , Yihao Quan , Xiaodong Lin , Ruixiang Tang This is my paper

Pith reviewed 2026-05-10 04:18 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords LLM hallucinationcitation generationneuron interventionfield-specific errorscausal analysiselastic net regularizationinternal model signals

0 comments

The pith

LLMs contain sparse field-specific neurons that causally drive citation hallucinations, which can be amplified to increase errors or suppressed to reduce them.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper studies how large language models fabricate citations, tracking errors across nine models and over 100,000 generated references. Author names prove far more unreliable than other fields such as titles or venues, and this pattern holds regardless of citation format. Probes trained to detect hallucinations in one field perform near chance on others, indicating the underlying signals are field-specific rather than general. In one model, regularization on neuron-level activation metrics isolates a small set of neurons tied to these failures. Direct intervention on those neurons raises hallucination rates when their activity is increased and lowers rates when it is decreased, confirming a causal role.

Core claim

A sparse set of field-specific hallucination neurons (FH-neurons) is identified in Qwen2.5-32B-Instruct via elastic-net regularization and stability selection applied to neuron CETT values. These neurons are shown to be causally linked to citation errors because amplifying their activations increases hallucination rates while suppressing them improves accuracy across citation fields.

What carries the argument

Field-specific hallucination neurons (FH-neurons), a sparse subset of neurons selected by elastic-net regularization with stability selection on CETT values, which causally modulate the rate of fabricated citations when their activations are scaled up or down.

If this is right

Author names are the most error-prone field in LLM-generated citations across models and styles.
Hallucination detection signals learned for one citation field transfer poorly to other fields.
Suppressing the identified neurons improves citation accuracy in multiple fields without requiring retraining or external knowledge.
Models trained with reasoning-oriented distillation show degraded citation recall compared with base models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same neuron-location technique could be tested on other factual error types such as invented facts or incorrect dates to check for modularity.
Targeted editing of these neurons might allow domain-specific hallucination reduction without affecting overall model capability.
Whether these neurons overlap with circuits handling general factuality would clarify if citation errors are a distinct internal mechanism.

Load-bearing premise

The neurons chosen by the regularization procedure are the actual cause of the field-specific hallucinations rather than simply correlated with them or with other aspects of text generation.

What would settle it

A replication in which suppressing the selected neurons produces no reduction in citation hallucination rates, or amplifying them produces no increase, would break the claimed causal connection.

Figures

Figures reproduced from arXiv: 2604.18880 by Ruixiang Tang, Xiaodong Lin, Yihao Quan, Yuefei Chen.

**Figure 2.** Figure 2: Per-field citation accuracy at N = 15 across all models. Author accuracy is consistently the lowest across all models, while DOI accuracy is notably higher for Qwen2.5-32B-Instruct than for all other models. 2.2 Verification Pipeline We first verify each generated reference against OpenAlex [17] through its public REST API. We use a two-stage lookup procedure. If a DOI is present, we query OpenAlex directl… view at source ↗

**Figure 4.** Figure 4: Per-field accuracy under model-level comparisons at N=15. (a) Qwen2.5-32B-Instruct (dense, 32B) leads on all fields. (b) Instruction tuning produces negligible differences across all fields. pattern suggests that chain-of-thought distillation prioritizes reasoning structure at the expense of factual memorization. 3 Probing for Field-Level Hallucination The preceding analysis establishes that citation hall… view at source ↗

**Figure 5.** Figure 5: Probe AUC across transformer layers for each bibliographic field in Qwen2.5-32B-Instruct. [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗

**Figure 6.** Figure 6: Probing pipeline for citation hallucination detection. [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗

**Figure 7.** Figure 7: presents a 5×5 cross-field AUC heatmap, where entry (i, j) reports the AUC of a probe trained on field i and evaluated on the test set of field j. Diagonal entries (in-field performance) range from 0.812 to 0.922, confirming that hallucination is reliably decodable within each field. Off-diagonal entries, however, remain near chance (0.46–0.59), indicating that a probe trained on one field’s hallucination… view at source ↗

**Figure 8.** Figure 8: Hallucination rate by citation position, [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗

**Figure 9.** Figure 9: Cross-field AUC heatmap on MistralSmall-24B-Instruct-2501. In-field performance remains strong, but off-diagonal transfer is higher than in Qwen2.5-32B-Instruct, indicating weaker field-specific separation. We repeated the cross-field probe transfer anal12 [PITH_FULL_IMAGE:figures/full_fig_p012_9.png] view at source ↗

**Figure 10.** Figure 10: Per-topic accuracy on Qwen2.5-32B-Instruct across the full 50-topic set, shown for the first [PITH_FULL_IMAGE:figures/full_fig_p014_10.png] view at source ↗

read the original abstract

LLMs frequently generate fictitious yet convincing citations, often expressing high confidence even when the underlying reference is wrong. We study this failure across 9 models and 108{,}000 generated references, and find that author names fail far more often than other fields across all models and settings. Citation style has no measurable effect, while reasoning-oriented distillation degrades recall. Probes trained on one field transfer at near-chance levels to the others, suggesting that hallucination signals do not generalize across fields. Building on this finding, we apply elastic-net regularization with stability selection to neuron-level CETT values of Qwen2.5-32B-Instruct and identify a sparse set of field-specific hallucination neurons (FH-neurons). Causal intervention further confirms their role: amplifying these neurons increases hallucination, while suppressing them improves performance across fields, with larger gains in some fields. These results suggest a lightweight approach to detecting and mitigating citation hallucination using internal model signals alone.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper isolates a sparse set of field-specific hallucination neurons in Qwen2.5-32B via CETT and elastic-net selection, then shows directional causal effects from amplification and suppression, but lacks controls to confirm those neurons are more specific than any sparse perturbation.

read the letter

This paper locates a small group of neurons tied to hallucinating citations in particular fields inside one 32B model and demonstrates that turning them up or down moves the hallucination rate in the expected direction. They run the basic pattern across nine models and 108k generated references, which is enough to make the author-name failure rate and the lack of probe transfer across fields look reliable. The non-transfer result is useful because it suggests the internal signals really are field-local rather than some generic detection failure. The causal interventions on the selected neurons are the clearest step beyond correlation. The selection itself uses elastic-net plus stability selection on CETT values, which is a reasonable way to get a sparse set without obvious circularity. The work stays grounded in external benchmarks and actual generation changes rather than reducing to fitted parameters alone. The soft spot is the missing control experiments. Without showing what happens when you amplify or suppress a matched number of random neurons or high-CETT neurons that the selection procedure rejected, it is hard to know whether the observed changes are specific to hallucination circuitry or just any sparse edit to the generation path. The stress-test note is on target here. Post-selection on the same data also leaves some room for overfitting even with stability selection. The paper is aimed at people doing mechanistic interpretability on LLMs or trying to build lightweight detectors for citation errors in generated text. A reader who cares about neuron-level handles on real failure modes will get concrete patterns and intervention results to think about. It deserves a serious referee because the scale is decent, the directional effects are reported, and the claims are falsifiable with the right controls. I would send it for review but ask the authors to add the random-neuron and non-selected-neuron intervention baselines before acceptance.

Referee Report

3 major / 2 minor

Summary. The paper studies citation hallucinations in LLMs across 9 models and 108,000 generated references, finding that author names fail at higher rates than other fields regardless of citation style. Reasoning-oriented distillation reduces recall, and field-specific probes transfer at near-chance levels, indicating non-generalizable hallucination signals. In Qwen2.5-32B-Instruct, elastic-net regularization with stability selection on neuron-level CETT values identifies sparse field-specific hallucination neurons (FH-neurons). Causal interventions show that amplifying these neurons increases hallucination rates while suppressing them improves citation performance across fields.

Significance. If the causal specificity of the FH-neurons holds after appropriate controls, the work provides a mechanistic, neuron-level account of field-specific citation hallucination and a lightweight internal-signal approach to detection and mitigation. The scale of the empirical evaluation (9 models, 108k references) and the use of causal interventions rather than purely correlational analysis are strengths that could inform targeted model editing techniques for knowledge-intensive generation tasks.

major comments (3)

[Causal intervention experiments] The causal intervention results (amplification increases hallucination; suppression improves performance) lack controls comparing effects to randomly sampled neuron sets or to high-CETT neurons excluded by stability selection. Without these baselines, the directional changes cannot be attributed specifically to FH-neurons rather than any sparse perturbation of generation circuitry.
[Neuron selection and FH-neuron identification] The elastic-net regularization and stability selection procedure for identifying FH-neurons (applied to CETT values) involves free hyperparameters whose sensitivity is not reported; combined with post-selection on the same data used for evaluation, this creates a risk that the sparse set is overfit to observed hallucination patterns rather than causally responsible.
[Results on interventions and probe transfer] The abstract and results sections provide no error bars, confidence intervals, or exact intervention protocols (e.g., scaling factors, number of neurons intervened, layer locations) for the reported hallucination rate changes; this weakens the ability to assess the magnitude and reliability of the claimed improvements.

minor comments (2)

[Methods] Clarify the exact definition and computation of CETT values in the methods section, including any preprocessing or normalization steps.
[Observational results] The claim that 'citation style has no measurable effect' should be supported by explicit statistical tests or effect sizes rather than a qualitative statement.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thoughtful and constructive comments on our manuscript. We address each major comment point by point below, indicating the revisions we will make to strengthen the paper.

read point-by-point responses

Referee: [Causal intervention experiments] The causal intervention results (amplification increases hallucination; suppression improves performance) lack controls comparing effects to randomly sampled neuron sets or to high-CETT neurons excluded by stability selection. Without these baselines, the directional changes cannot be attributed specifically to FH-neurons rather than any sparse perturbation of generation circuitry.

Authors: We agree that additional control experiments are required to establish the specificity of the FH-neurons. In the revised manuscript, we will add causal interventions on randomly sampled neuron sets matched for sparsity and on high-CETT neurons excluded by stability selection, using identical amplification and suppression protocols. These baselines will be reported alongside the original results to demonstrate that the observed changes are attributable to the selected FH-neurons rather than generic sparse perturbations. revision: yes
Referee: [Neuron selection and FH-neuron identification] The elastic-net regularization and stability selection procedure for identifying FH-neurons (applied to CETT values) involves free hyperparameters whose sensitivity is not reported; combined with post-selection on the same data used for evaluation, this creates a risk that the sparse set is overfit to observed hallucination patterns rather than causally responsible.

Authors: We acknowledge the need for greater transparency on hyperparameters and data usage. The revised methods section will fully document all hyperparameters for elastic-net regularization and stability selection, including their specific values and selection rationale. We will also include a sensitivity analysis showing the stability of the identified neuron set across a range of hyperparameter values. To address post-selection concerns, we will explicitly document the data partitioning procedure used for CETT computation and subsequent interventions, clarifying any separation between selection and evaluation data. revision: yes
Referee: [Results on interventions and probe transfer] The abstract and results sections provide no error bars, confidence intervals, or exact intervention protocols (e.g., scaling factors, number of neurons intervened, layer locations) for the reported hallucination rate changes; this weakens the ability to assess the magnitude and reliability of the claimed improvements.

Authors: We regret these omissions from the main text. In the revision, we will add error bars (standard errors across runs) and 95% confidence intervals to all reported hallucination rates in the results section and figures. We will also expand the methods section with complete intervention protocols, including exact scaling factors, number of neurons per intervention, layer locations, and other parameters. These details will be moved from the appendix into the main body for clarity and reproducibility. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper identifies FH-neurons via elastic-net and stability selection on CETT values computed from generated references, then validates via causal interventions (amplification/suppression) that alter hallucination rates. These steps depend on external benchmarks (108k generated references across models) and direct interventions rather than any equation or selection procedure that reduces the outcome to the input data by construction. No self-citations are load-bearing for the core claim, no ansatz is smuggled, and no renaming of known results occurs. The derivation chain remains self-contained against the experimental data.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard assumptions from mechanistic interpretability (neurons can be causally intervened without major side effects) and statistical feature selection (elastic-net plus stability selection isolates relevant units). No new physical entities are postulated.

free parameters (1)

elastic-net regularization strength and stability selection threshold
Hyperparameters chosen to produce a sparse set of FH-neurons; their exact values are not stated in the abstract but directly determine which neurons are labeled causal.

axioms (2)

domain assumption Causal interventions on individual neurons produce interpretable changes in output behavior without compensatory effects from the rest of the network.
Invoked when claiming that amplifying or suppressing the selected neurons directly controls hallucination rates.
domain assumption Probe transfer failure across fields indicates genuinely field-specific mechanisms rather than insufficient probe capacity.
Used to interpret the near-chance transfer results as evidence against a single general hallucination signal.

pith-pipeline@v0.9.0 · 5473 in / 1370 out tokens · 32789 ms · 2026-05-10T04:18:31.188097+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages

[1]

Causal inference and treatment effect estima- tion

work page
[2]

Bayesian optimization and probabilistic mod- eling

work page
[3]

Reinforcement learning and policy optimiza- tion

work page
[4]

Robustness and adversarial machine learning

work page
[5]

Explainable AI and interpretability methods

work page
[6]

Representation learning and self-supervised learning

work page
[7]

Graph neural networks and graph learning

work page
[8]

Federated learning and distributed ML

work page
[9]

Fairness and bias mitigation in ML

work page
[10]

ML evaluation, benchmarks, and repro- ducibility

work page
[11]

Information extraction and relation extrac- tion

work page
[12]

Question answering and retrieval-augmented generation

work page
[13]

Machine translation and multilingual NLP

work page
[14]

Summarization and factual consistency

work page
[15]

Dialogue systems and conversational agents

work page
[16]

Prompting and instruction tuning methods

work page
[17]

NLP robustness, safety, and alignment

work page
[18]

Text generation evaluation and metrics

work page
[19]

Knowledge grounding and entity linking

work page
[20]

Long-context modeling and efficient attention

work page
[21]

Image classification and representation learn- ing 11

work page
[22]

Object detection and instance segmentation

work page
[23]

Vision transformers and efficient vision mod- els

work page
[24]

3D vision and point cloud understanding

work page
[25]

Visual question answering and vision- language models

work page
[26]

Video understanding and temporal action recognition

work page
[27]

Generative vision models and diffusion

work page
[28]

Self-supervised learning for vision

work page
[29]

Vision robustness and adversarial attacks

work page
[30]

Medical imaging and computer-aided diagno- sis

work page
[31]

Distributed systems and consensus protocols

work page
[32]

Cloud computing and serverless architectures

work page
[33]

Storage systems and key-value stores

work page
[34]

Operating systems scheduling and resource management

work page
[35]

Compilers and code optimization

work page
[36]

Systems performance modeling and profiling

work page
[37]

Datacenter networking and traffic engineering

work page
[38]

GPU systems and ML systems optimization

work page
[39]

Fault tolerance and reliability engineering

work page
[40]

Observability and telemetry systems

work page
[41]

Network security and intrusion detection

work page
[42]

Cryptography and secure computation

work page
[43]

Privacy-preserving data analysis and differen- tial privacy

work page
[44]

Malware analysis and reverse engineering

work page
[45]

Web security and authentication protocols

work page
[46]

Database query optimization and indexing

work page
[47]

Data integration and entity resolution

work page
[48]

Human-computer interaction and usability studies

work page
[49]

Program analysis and static/dynamic analy- sis

work page
[50]

As shown in Figure 8, the first two citations exhibit notably lower hallucination rates, after which performance rapidly degrades and plateaus from position 3 on- ward

Approximation algorithms and randomized algorithms D Output Volume Analysis To examine whether citation order within a prompt affects reliability, we analyze hallucina- tion rate as a function of citation position across all models and generation volumes. As shown in Figure 8, the first two citations exhibit notably lower hallucination rates, after which ...

work page

[1] [1]

Causal inference and treatment effect estima- tion

work page

[2] [2]

Bayesian optimization and probabilistic mod- eling

work page

[3] [3]

Reinforcement learning and policy optimiza- tion

work page

[4] [4]

Robustness and adversarial machine learning

work page

[5] [5]

Explainable AI and interpretability methods

work page

[6] [6]

Representation learning and self-supervised learning

work page

[7] [7]

Graph neural networks and graph learning

work page

[8] [8]

Federated learning and distributed ML

work page

[9] [9]

Fairness and bias mitigation in ML

work page

[10] [10]

ML evaluation, benchmarks, and repro- ducibility

work page

[11] [11]

Information extraction and relation extrac- tion

work page

[12] [12]

Question answering and retrieval-augmented generation

work page

[13] [13]

Machine translation and multilingual NLP

work page

[14] [14]

Summarization and factual consistency

work page

[15] [15]

Dialogue systems and conversational agents

work page

[16] [16]

Prompting and instruction tuning methods

work page

[17] [17]

NLP robustness, safety, and alignment

work page

[18] [18]

Text generation evaluation and metrics

work page

[19] [19]

Knowledge grounding and entity linking

work page

[20] [20]

Long-context modeling and efficient attention

work page

[21] [21]

Image classification and representation learn- ing 11

work page

[22] [22]

Object detection and instance segmentation

work page

[23] [23]

Vision transformers and efficient vision mod- els

work page

[24] [24]

3D vision and point cloud understanding

work page

[25] [25]

Visual question answering and vision- language models

work page

[26] [26]

Video understanding and temporal action recognition

work page

[27] [27]

Generative vision models and diffusion

work page

[28] [28]

Self-supervised learning for vision

work page

[29] [29]

Vision robustness and adversarial attacks

work page

[30] [30]

Medical imaging and computer-aided diagno- sis

work page

[31] [31]

Distributed systems and consensus protocols

work page

[32] [32]

Cloud computing and serverless architectures

work page

[33] [33]

Storage systems and key-value stores

work page

[34] [34]

Operating systems scheduling and resource management

work page

[35] [35]

Compilers and code optimization

work page

[36] [36]

Systems performance modeling and profiling

work page

[37] [37]

Datacenter networking and traffic engineering

work page

[38] [38]

GPU systems and ML systems optimization

work page

[39] [39]

Fault tolerance and reliability engineering

work page

[40] [40]

Observability and telemetry systems

work page

[41] [41]

Network security and intrusion detection

work page

[42] [42]

Cryptography and secure computation

work page

[43] [43]

Privacy-preserving data analysis and differen- tial privacy

work page

[44] [44]

Malware analysis and reverse engineering

work page

[45] [45]

Web security and authentication protocols

work page

[46] [46]

Database query optimization and indexing

work page

[47] [47]

Data integration and entity resolution

work page

[48] [48]

Human-computer interaction and usability studies

work page

[49] [49]

Program analysis and static/dynamic analy- sis

work page

[50] [50]

As shown in Figure 8, the first two citations exhibit notably lower hallucination rates, after which performance rapidly degrades and plateaus from position 3 on- ward

Approximation algorithms and randomized algorithms D Output Volume Analysis To examine whether citation order within a prompt affects reliability, we analyze hallucina- tion rate as a function of citation position across all models and generation volumes. As shown in Figure 8, the first two citations exhibit notably lower hallucination rates, after which ...

work page