pith. sign in

arxiv: 2604.18880 · v1 · submitted 2026-04-20 · 💻 cs.CL · cs.AI

Where Fake Citations Are Made: Tracing Field-Level Hallucination to Specific Neurons in LLMs

Pith reviewed 2026-05-10 04:18 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords LLM hallucinationcitation generationneuron interventionfield-specific errorscausal analysiselastic net regularizationinternal model signals
0
0 comments X

The pith

LLMs contain sparse field-specific neurons that causally drive citation hallucinations, which can be amplified to increase errors or suppressed to reduce them.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper studies how large language models fabricate citations, tracking errors across nine models and over 100,000 generated references. Author names prove far more unreliable than other fields such as titles or venues, and this pattern holds regardless of citation format. Probes trained to detect hallucinations in one field perform near chance on others, indicating the underlying signals are field-specific rather than general. In one model, regularization on neuron-level activation metrics isolates a small set of neurons tied to these failures. Direct intervention on those neurons raises hallucination rates when their activity is increased and lowers rates when it is decreased, confirming a causal role.

Core claim

A sparse set of field-specific hallucination neurons (FH-neurons) is identified in Qwen2.5-32B-Instruct via elastic-net regularization and stability selection applied to neuron CETT values. These neurons are shown to be causally linked to citation errors because amplifying their activations increases hallucination rates while suppressing them improves accuracy across citation fields.

What carries the argument

Field-specific hallucination neurons (FH-neurons), a sparse subset of neurons selected by elastic-net regularization with stability selection on CETT values, which causally modulate the rate of fabricated citations when their activations are scaled up or down.

If this is right

  • Author names are the most error-prone field in LLM-generated citations across models and styles.
  • Hallucination detection signals learned for one citation field transfer poorly to other fields.
  • Suppressing the identified neurons improves citation accuracy in multiple fields without requiring retraining or external knowledge.
  • Models trained with reasoning-oriented distillation show degraded citation recall compared with base models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same neuron-location technique could be tested on other factual error types such as invented facts or incorrect dates to check for modularity.
  • Targeted editing of these neurons might allow domain-specific hallucination reduction without affecting overall model capability.
  • Whether these neurons overlap with circuits handling general factuality would clarify if citation errors are a distinct internal mechanism.

Load-bearing premise

The neurons chosen by the regularization procedure are the actual cause of the field-specific hallucinations rather than simply correlated with them or with other aspects of text generation.

What would settle it

A replication in which suppressing the selected neurons produces no reduction in citation hallucination rates, or amplifying them produces no increase, would break the claimed causal connection.

Figures

Figures reproduced from arXiv: 2604.18880 by Ruixiang Tang, Xiaodong Lin, Yihao Quan, Yuefei Chen.

Figure 1
Figure 1. Figure 1: Overview of citation hallucination in LLMs. Given a topic, models generate references [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Per-field citation accuracy at N = 15 across all models. Author accuracy is consistently the lowest across all models, while DOI accuracy is notably higher for Qwen2.5-32B-Instruct than for all other models. 2.2 Verification Pipeline We first verify each generated reference against OpenAlex [17] through its public REST API. We use a two-stage lookup procedure. If a DOI is present, we query OpenAlex directl… view at source ↗
Figure 4
Figure 4. Figure 4: Per-field accuracy under model-level comparisons at N=15. (a) Qwen2.5-32B-Instruct (dense, 32B) leads on all fields. (b) Instruction tuning produces negligible differences across all fields. pattern suggests that chain-of-thought distillation prioritizes reasoning structure at the expense of factual memorization. 3 Probing for Field-Level Hal￾lucination The preceding analysis establishes that citation hall… view at source ↗
Figure 5
Figure 5. Figure 5: Probe AUC across transformer layers for each bibliographic field in Qwen2.5-32B-Instruct. [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Probing pipeline for citation hallucination detection. [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: presents a 5×5 cross-field AUC heatmap, where entry (i, j) reports the AUC of a probe trained on field i and evaluated on the test set of field j. Diagonal entries (in-field performance) range from 0.812 to 0.922, confirming that hal￾lucination is reliably decodable within each field. Off-diagonal entries, however, remain near chance (0.46–0.59), indicating that a probe trained on one field’s hallucination… view at source ↗
Figure 8
Figure 8. Figure 8: Hallucination rate by citation position, [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Cross-field AUC heatmap on Mistral￾Small-24B-Instruct-2501. In-field performance re￾mains strong, but off-diagonal transfer is higher than in Qwen2.5-32B-Instruct, indicating weaker field-specific separation. We repeated the cross-field probe transfer anal￾12 [PITH_FULL_IMAGE:figures/full_fig_p012_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Per-topic accuracy on Qwen2.5-32B-Instruct across the full 50-topic set, shown for the first [PITH_FULL_IMAGE:figures/full_fig_p014_10.png] view at source ↗
read the original abstract

LLMs frequently generate fictitious yet convincing citations, often expressing high confidence even when the underlying reference is wrong. We study this failure across 9 models and 108{,}000 generated references, and find that author names fail far more often than other fields across all models and settings. Citation style has no measurable effect, while reasoning-oriented distillation degrades recall. Probes trained on one field transfer at near-chance levels to the others, suggesting that hallucination signals do not generalize across fields. Building on this finding, we apply elastic-net regularization with stability selection to neuron-level CETT values of Qwen2.5-32B-Instruct and identify a sparse set of field-specific hallucination neurons (FH-neurons). Causal intervention further confirms their role: amplifying these neurons increases hallucination, while suppressing them improves performance across fields, with larger gains in some fields. These results suggest a lightweight approach to detecting and mitigating citation hallucination using internal model signals alone.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper studies citation hallucinations in LLMs across 9 models and 108,000 generated references, finding that author names fail at higher rates than other fields regardless of citation style. Reasoning-oriented distillation reduces recall, and field-specific probes transfer at near-chance levels, indicating non-generalizable hallucination signals. In Qwen2.5-32B-Instruct, elastic-net regularization with stability selection on neuron-level CETT values identifies sparse field-specific hallucination neurons (FH-neurons). Causal interventions show that amplifying these neurons increases hallucination rates while suppressing them improves citation performance across fields.

Significance. If the causal specificity of the FH-neurons holds after appropriate controls, the work provides a mechanistic, neuron-level account of field-specific citation hallucination and a lightweight internal-signal approach to detection and mitigation. The scale of the empirical evaluation (9 models, 108k references) and the use of causal interventions rather than purely correlational analysis are strengths that could inform targeted model editing techniques for knowledge-intensive generation tasks.

major comments (3)
  1. [Causal intervention experiments] The causal intervention results (amplification increases hallucination; suppression improves performance) lack controls comparing effects to randomly sampled neuron sets or to high-CETT neurons excluded by stability selection. Without these baselines, the directional changes cannot be attributed specifically to FH-neurons rather than any sparse perturbation of generation circuitry.
  2. [Neuron selection and FH-neuron identification] The elastic-net regularization and stability selection procedure for identifying FH-neurons (applied to CETT values) involves free hyperparameters whose sensitivity is not reported; combined with post-selection on the same data used for evaluation, this creates a risk that the sparse set is overfit to observed hallucination patterns rather than causally responsible.
  3. [Results on interventions and probe transfer] The abstract and results sections provide no error bars, confidence intervals, or exact intervention protocols (e.g., scaling factors, number of neurons intervened, layer locations) for the reported hallucination rate changes; this weakens the ability to assess the magnitude and reliability of the claimed improvements.
minor comments (2)
  1. [Methods] Clarify the exact definition and computation of CETT values in the methods section, including any preprocessing or normalization steps.
  2. [Observational results] The claim that 'citation style has no measurable effect' should be supported by explicit statistical tests or effect sizes rather than a qualitative statement.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thoughtful and constructive comments on our manuscript. We address each major comment point by point below, indicating the revisions we will make to strengthen the paper.

read point-by-point responses
  1. Referee: [Causal intervention experiments] The causal intervention results (amplification increases hallucination; suppression improves performance) lack controls comparing effects to randomly sampled neuron sets or to high-CETT neurons excluded by stability selection. Without these baselines, the directional changes cannot be attributed specifically to FH-neurons rather than any sparse perturbation of generation circuitry.

    Authors: We agree that additional control experiments are required to establish the specificity of the FH-neurons. In the revised manuscript, we will add causal interventions on randomly sampled neuron sets matched for sparsity and on high-CETT neurons excluded by stability selection, using identical amplification and suppression protocols. These baselines will be reported alongside the original results to demonstrate that the observed changes are attributable to the selected FH-neurons rather than generic sparse perturbations. revision: yes

  2. Referee: [Neuron selection and FH-neuron identification] The elastic-net regularization and stability selection procedure for identifying FH-neurons (applied to CETT values) involves free hyperparameters whose sensitivity is not reported; combined with post-selection on the same data used for evaluation, this creates a risk that the sparse set is overfit to observed hallucination patterns rather than causally responsible.

    Authors: We acknowledge the need for greater transparency on hyperparameters and data usage. The revised methods section will fully document all hyperparameters for elastic-net regularization and stability selection, including their specific values and selection rationale. We will also include a sensitivity analysis showing the stability of the identified neuron set across a range of hyperparameter values. To address post-selection concerns, we will explicitly document the data partitioning procedure used for CETT computation and subsequent interventions, clarifying any separation between selection and evaluation data. revision: yes

  3. Referee: [Results on interventions and probe transfer] The abstract and results sections provide no error bars, confidence intervals, or exact intervention protocols (e.g., scaling factors, number of neurons intervened, layer locations) for the reported hallucination rate changes; this weakens the ability to assess the magnitude and reliability of the claimed improvements.

    Authors: We regret these omissions from the main text. In the revision, we will add error bars (standard errors across runs) and 95% confidence intervals to all reported hallucination rates in the results section and figures. We will also expand the methods section with complete intervention protocols, including exact scaling factors, number of neurons per intervention, layer locations, and other parameters. These details will be moved from the appendix into the main body for clarity and reproducibility. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper identifies FH-neurons via elastic-net and stability selection on CETT values computed from generated references, then validates via causal interventions (amplification/suppression) that alter hallucination rates. These steps depend on external benchmarks (108k generated references across models) and direct interventions rather than any equation or selection procedure that reduces the outcome to the input data by construction. No self-citations are load-bearing for the core claim, no ansatz is smuggled, and no renaming of known results occurs. The derivation chain remains self-contained against the experimental data.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard assumptions from mechanistic interpretability (neurons can be causally intervened without major side effects) and statistical feature selection (elastic-net plus stability selection isolates relevant units). No new physical entities are postulated.

free parameters (1)
  • elastic-net regularization strength and stability selection threshold
    Hyperparameters chosen to produce a sparse set of FH-neurons; their exact values are not stated in the abstract but directly determine which neurons are labeled causal.
axioms (2)
  • domain assumption Causal interventions on individual neurons produce interpretable changes in output behavior without compensatory effects from the rest of the network.
    Invoked when claiming that amplifying or suppressing the selected neurons directly controls hallucination rates.
  • domain assumption Probe transfer failure across fields indicates genuinely field-specific mechanisms rather than insufficient probe capacity.
    Used to interpret the near-chance transfer results as evidence against a single general hallucination signal.

pith-pipeline@v0.9.0 · 5473 in / 1370 out tokens · 32789 ms · 2026-05-10T04:18:31.188097+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages

  1. [1]

    Causal inference and treatment effect estima- tion

  2. [2]

    Bayesian optimization and probabilistic mod- eling

  3. [3]

    Reinforcement learning and policy optimiza- tion

  4. [4]

    Robustness and adversarial machine learning

  5. [5]

    Explainable AI and interpretability methods

  6. [6]

    Representation learning and self-supervised learning

  7. [7]

    Graph neural networks and graph learning

  8. [8]

    Federated learning and distributed ML

  9. [9]

    Fairness and bias mitigation in ML

  10. [10]

    ML evaluation, benchmarks, and repro- ducibility

  11. [11]

    Information extraction and relation extrac- tion

  12. [12]

    Question answering and retrieval-augmented generation

  13. [13]

    Machine translation and multilingual NLP

  14. [14]

    Summarization and factual consistency

  15. [15]

    Dialogue systems and conversational agents

  16. [16]

    Prompting and instruction tuning methods

  17. [17]

    NLP robustness, safety, and alignment

  18. [18]

    Text generation evaluation and metrics

  19. [19]

    Knowledge grounding and entity linking

  20. [20]

    Long-context modeling and efficient attention

  21. [21]

    Image classification and representation learn- ing 11

  22. [22]

    Object detection and instance segmentation

  23. [23]

    Vision transformers and efficient vision mod- els

  24. [24]

    3D vision and point cloud understanding

  25. [25]

    Visual question answering and vision- language models

  26. [26]

    Video understanding and temporal action recognition

  27. [27]

    Generative vision models and diffusion

  28. [28]

    Self-supervised learning for vision

  29. [29]

    Vision robustness and adversarial attacks

  30. [30]

    Medical imaging and computer-aided diagno- sis

  31. [31]

    Distributed systems and consensus protocols

  32. [32]

    Cloud computing and serverless architectures

  33. [33]

    Storage systems and key-value stores

  34. [34]

    Operating systems scheduling and resource management

  35. [35]

    Compilers and code optimization

  36. [36]

    Systems performance modeling and profiling

  37. [37]

    Datacenter networking and traffic engineering

  38. [38]

    GPU systems and ML systems optimization

  39. [39]

    Fault tolerance and reliability engineering

  40. [40]

    Observability and telemetry systems

  41. [41]

    Network security and intrusion detection

  42. [42]

    Cryptography and secure computation

  43. [43]

    Privacy-preserving data analysis and differen- tial privacy

  44. [44]

    Malware analysis and reverse engineering

  45. [45]

    Web security and authentication protocols

  46. [46]

    Database query optimization and indexing

  47. [47]

    Data integration and entity resolution

  48. [48]

    Human-computer interaction and usability studies

  49. [49]

    Program analysis and static/dynamic analy- sis

  50. [50]

    As shown in Figure 8, the first two citations exhibit notably lower hallucination rates, after which performance rapidly degrades and plateaus from position 3 on- ward

    Approximation algorithms and randomized algorithms D Output Volume Analysis To examine whether citation order within a prompt affects reliability, we analyze hallucina- tion rate as a function of citation position across all models and generation volumes. As shown in Figure 8, the first two citations exhibit notably lower hallucination rates, after which ...