Recognition: unknown
Hallucination as Trajectory Commitment: Causal Evidence for Asymmetric Attractor Dynamics in Transformer Generation
Pith reviewed 2026-05-10 10:57 UTC · model grok-4.3
The pith
Hallucination in language models arises as an early commitment to a false generation trajectory governed by asymmetric attractor dynamics.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We present causal evidence that hallucination in autoregressive language models is an early trajectory commitment governed by asymmetric attractor dynamics. Using same-prompt bifurcation, in which we repeatedly sample identical inputs to observe spontaneous divergence, we isolate trajectory dynamics from prompt-level confounds. On Qwen2.5-1.5B across 61 prompts spanning six categories, 27 prompts bifurcate with factual and hallucinated trajectories diverging at the first generated token. Activation patching across layers reveals a pronounced causal asymmetry: injecting a hallucinated activation into a correct trajectory corrupts output far more often than the reverse recovers it, and window-
What carries the argument
same-prompt bifurcation combined with targeted activation patching that exposes asymmetric stability between correct and hallucinated generation paths
If this is right
- Prompt encodings already visible at step zero can forecast which inputs will produce high hallucination rates.
- A single perturbation can push the model into a hallucinated basin, but sustained intervention across multiple steps is required to pull it out.
- The basins themselves cluster into a small number of regime-like groups identifiable from the prompt representation alone.
- Bifurcation happens at the first token for a substantial fraction of prompts, separating factual and false paths before further generation occurs.
Where Pith is reading between the lines
- If the asymmetry persists in larger models, early steering of the residual stream could become a practical defense against hallucination.
- The same commitment mechanism may underlie other generation failures such as repetitive loops or sudden topic shifts.
- Training objectives that penalize rapid basin entry could flatten the attractors without changing model scale.
Load-bearing premise
The interventions used to swap activations between trajectories do not create artificial effects that differ from the model's ordinary dynamics, and the observed patterns hold beyond the tested model size and prompt set.
What would settle it
Finding that a single activation patch corrects a hallucinated trajectory as readily as it corrupts a correct one would falsify the claimed asymmetry.
Figures
read the original abstract
We present causal evidence that hallucination in autoregressive language models is an early trajectory commitment governed by asymmetric attractor dynamics. Using same-prompt bifurcation, in which we repeatedly sample identical inputs to observe spontaneous divergence, we isolate trajectory dynamics from prompt-level confounds. On Qwen2.5-1.5B across 61 prompts spanning six categories, 27 prompts (44.3%) bifurcate with factual and hallucinated trajectories diverging at the first generated token (KL = 0 at step 0, KL > 1.0 at step 1). Activation patching across 28 layers reveals a pronounced causal asymmetry: injecting a hallucinated activation into a correct trajectory corrupts output in 87.5% of trials (layer 20), while the reverse recovers only 33.3% (layer 24); both exceed the 10.4% baseline (p = 0.025) and 12.5% random-patch control. Window patching shows correction requires sustained multi-step intervention, whereas corruption needs only a single perturbation. Probing the prompt encoding itself, step-0 residual states predict per-prompt hallucination rate at Pearson r = 0.776 at layer 15 (p < 0.001 against a 1000-permutation null); unsupervised clustering identifies five regime-like groups (eta^2 = 0.55) whose saddle-adjacent cluster concentrates 12 of the 13 bifurcating false-premise prompts, indicating that the basin structure is organized around regime commitments fixed at prompt encoding. These findings characterize hallucination as a locally stable attractor basin: entry is probabilistic and rapid, exit demands coordinated intervention across layers and steps, and the relevant basins are selected by clusterable regimes already discernible at step 0.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript claims that hallucinations in autoregressive transformers arise from early, probabilistic commitment to asymmetric attractor basins in the generation trajectory. Using same-prompt bifurcation on Qwen2.5-1.5B across 61 prompts, it reports that 27 prompts (44.3%) spontaneously diverge into factual versus hallucinated paths at the first generated token (KL>1.0). Activation patching across layers shows strong directional asymmetry (hallucinated-into-correct corrupts 87.5% at layer 20; reverse recovers only 33.3% at layer 24, both p=0.025 vs. baseline/random controls), while window patching indicates that exit from the hallucinated basin requires sustained multi-step intervention. Prompt-level residual states at step 0 predict per-prompt hallucination rates (r=0.776 at layer 15) and cluster into regime-like groups whose saddle-adjacent cluster concentrates false-premise bifurcators.
Significance. If the causal claims hold, the work supplies one of the first interventional, dynamical-systems accounts of hallucination, distinguishing entry (rapid, probabilistic) from exit (coordinated, sustained) and linking both to prompt-encoding structure. The multi-method design (bifurcation + patching + predictive probing + clustering) with statistical controls is a strength; reproducible code or public data would further increase impact for detection and mitigation research.
major comments (3)
- [activation patching results] Activation patching experiments (described in the results on directional asymmetry): the observed 87.5% vs. 33.3% asymmetry could arise from differential sensitivity to out-of-distribution activations rather than stable attractor basins. Direct replacement of activations can disrupt attention keys/values and residual norms independently of natural dynamics; additional controls (e.g., patching with activations drawn from other correct trajectories or noise-matched interventions) are needed to isolate the claimed causal role of attractor structure.
- [window patching and bifurcation analysis] Bifurcation and window-patching sections: the claim that correction requires sustained multi-step intervention while corruption needs only a single step is load-bearing for the asymmetric-attractor interpretation, yet the manuscript does not report how layer and window sizes were chosen or whether they were pre-specified versus selected after observing the data. Post-hoc selection risks inflating the apparent asymmetry.
- [prompt encoding probe and unsupervised clustering] Prompt-encoding probe and clustering (layer 15 residual states, eta^2=0.55): while the Pearson r=0.776 and concentration of false-premise prompts in one cluster are suggestive, the analysis must demonstrate that the clusters are not confounded by surface features such as prompt length, lexical overlap, or category (the six categories are mentioned but not balanced in the clustering validation).
minor comments (2)
- [experimental setup] The 61-prompt corpus and its six categories should be described with explicit inclusion criteria and balance statistics; without this, it is difficult to assess selection bias in the 44.3% bifurcation rate.
- [figures and tables] Figure legends and table captions should explicitly state the exact statistical tests, permutation counts, and multiple-comparison corrections used for the p=0.025 and r=0.776 results.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. The suggestions highlight important methodological considerations that we will address through targeted revisions and additional analyses. We respond to each major comment below.
read point-by-point responses
-
Referee: [activation patching results] Activation patching experiments (described in the results on directional asymmetry): the observed 87.5% vs. 33.3% asymmetry could arise from differential sensitivity to out-of-distribution activations rather than stable attractor basins. Direct replacement of activations can disrupt attention keys/values and residual norms independently of natural dynamics; additional controls (e.g., patching with activations drawn from other correct trajectories or noise-matched interventions) are needed to isolate the claimed causal role of attractor structure.
Authors: We agree that additional controls would further bolster the interpretation. Our existing random-patch control yields a corruption rate (12.5%) statistically indistinguishable from the no-patch baseline (10.4%), suggesting that non-specific disruptions do not drive the effect. The pronounced directional asymmetry (87.5% vs. 33.3%) is observed consistently across multiple layers and exceeds both controls with p=0.025. To directly address the referee's concern, we will add in the revision: (i) patching using activations sampled from other correct trajectories within the same prompt set, and (ii) noise-matched interventions scaled to match the norm of the target activations. These will be reported alongside the original results. revision: yes
-
Referee: [window patching and bifurcation analysis] Bifurcation and window-patching sections: the claim that correction requires sustained multi-step intervention while corruption needs only a single step is load-bearing for the asymmetric-attractor interpretation, yet the manuscript does not report how layer and window sizes were chosen or whether they were pre-specified versus selected after observing the data. Post-hoc selection risks inflating the apparent asymmetry.
Authors: We acknowledge the importance of transparency in experimental design choices. The layers for peak effects (layer 20 for corruption, layer 24 for recovery) were identified by scanning all 28 layers and selecting those with maximal directional difference; window sizes were set to 1 for single-step and 3-5 for multi-step based on initial observations of trajectory divergence. This process was not fully pre-specified. In the revised manuscript, we will: (1) explicitly describe the layer and window selection procedure, (2) include a supplementary analysis showing the asymmetry holds across a range of window sizes (e.g., 1-7 steps), and (3) report results for all layers to demonstrate that the chosen layers are representative rather than cherry-picked. We believe these additions will confirm the robustness of the asymmetric dynamics. revision: yes
-
Referee: [prompt encoding probe and unsupervised clustering] Prompt-encoding probe and clustering (layer 15 residual states, eta^2=0.55): while the Pearson r=0.776 and concentration of false-premise prompts in one cluster are suggestive, the analysis must demonstrate that the clusters are not confounded by surface features such as prompt length, lexical overlap, or category (the six categories are mentioned but not balanced in the clustering validation).
Authors: We appreciate this point on potential confounds. The clustering was performed on residual states at layer 15, which capture semantic and structural information beyond surface features. However, to rule out confounds, we will add in the revision: (a) Pearson correlations between cluster labels and prompt length, showing no significant association; (b) analysis of lexical overlap (e.g., Jaccard similarity) across clusters, demonstrating that the saddle-adjacent cluster's enrichment for false-premise prompts persists after controlling for overlap; and (c) a balanced sub-sampling or category-stratified clustering validation to confirm that the concentration (12 of 13 false-premise bifurcators) is not an artifact of category imbalance. These controls will be included to strengthen the claim that the regime-like groups reflect attractor-relevant structure in the prompt encoding. revision: yes
Circularity Check
No significant circularity
full rationale
The paper presents purely empirical results from interventional experiments (same-prompt bifurcation on 61 prompts, activation patching across 28 layers, window patching, prompt-encoding probes with Pearson r and eta^2 clustering). No equations, derivations, fitted parameters renamed as predictions, self-citations, or ansatzes appear in the abstract or described methods. Claims rest on observed data, statistical controls (p=0.025, permutation nulls), and direct comparisons to baselines/random controls rather than reducing to self-referential definitions or load-bearing self-citations. This matches the default expectation for non-circular empirical work.
Axiom & Free-Parameter Ledger
axioms (1)
- standard math Standard assumptions for statistical significance testing (e.g., independence in permutations)
invented entities (1)
-
asymmetric attractor dynamics
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Survey of hallucination in natural language generation,
Z. Ji et al., “Survey of hallucination in natural language generation,”ACM Computing Surveys, vol. 55, no. 12, 2023
2023
-
[2]
L. Huang et al., “A survey on hallucination in large language models,”arXiv:2311.05232, 2023
work page internal anchor Pith review arXiv 2023
-
[3]
The internal state of an LLM knows when it’s lying,
A. Azaria and T. Mitchell, “The internal state of an LLM knows when it’s lying,” inFindings of EMNLP, 2023
2023
-
[4]
Inference-time intervention: Eliciting truthful answers from a language model,
K. Li, O. Patel, F. Viegas, H. Pfister, and M. Wattenberg, “Inference-time intervention: Eliciting truthful answers from a language model,” inNeurIPS, 2023
2023
-
[5]
Language Models (Mostly) Know What They Know
S. Kadavath et al., “Language models (mostly) know what they know,”arXiv:2207.05221, 2022
work page internal anchor Pith review arXiv 2022
-
[6]
Representation Engineering: A Top-Down Approach to AI Transparency
A. Zou et al., “Representation engineering: A top-down approach to AI transparency,” arXiv:2310.01405, 2023
work page internal anchor Pith review arXiv 2023
-
[7]
Hallucination Basins: A Dynamic Framework for Understanding and Controlling LLM Hallucinations
K. Cherukuri and L. R. Varshney, “Hallucination basins,”arXiv:2604.04743, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[8]
Locating and editing factual associations in GPT,
K. Meng, D. Bau, A. Andonian, and Y. Belinkov, “Locating and editing factual associations in GPT,” inNeurIPS, 2022
2022
-
[9]
How to use and interpret activation patching.arXiv preprint arXiv:2404.15255,
S. Heimersheim and N. Nanda, “How to use and interpret activation patching,”arXiv:2404.15255, 2024
-
[10]
From noise to narrative,
P. Suresh, J. Stanley, S. Joseph, L. Scimeca, and D. Bzdok, “From noise to narrative,” inNeurIPS, 2025
2025
-
[11]
Projected Autoregression: Autoregressive Language Generation in Continuous State Space
O. Naparstek, “Projected autoregression: Autoregressive language generation in continuous state space,”arXiv:2601.04854, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[12]
Discovering latent knowledge in language models without supervision,
C. Burns, H. Ye, D. Klein, and J. Steinhardt, “Discovering latent knowledge in language models without supervision,” inICLR, 2023
2023
-
[13]
TransformerLens,
N. Nanda and J. Bloom, “TransformerLens,”https://github.com/TransformerLensOrg/ TransformerLens, 2022
2022
-
[14]
Qwen Team, “Qwen2.5 technical report,”arXiv:2412.15115, 2024. 15 A Full Bifurcation Results Table 3: Complete bifurcation results for all 61 prompts (N= 20 samples,τ= 0.7). C = Correct, H = Hallucination, O = Other.⋆= bifurcating. Idx Category C H O Bif. Prompt (truncated) 0 factual 0 3 17 The 23rd President of the United S. . . 1 factual 0 2 18 The 14th ...
work page internal anchor Pith review Pith/arXiv arXiv 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.