arxiv: 2604.15400 · v1 · submitted 2026-04-16 · 💻 cs.LG · cs.AI· cs.CL

Recognition: unknown

Hallucination as Trajectory Commitment: Causal Evidence for Asymmetric Attractor Dynamics in Transformer Generation

G. Aytug Akarlar

Authors on Pith no claims yet

Pith reviewed 2026-05-10 10:57 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL

keywords hallucinationattractor dynamicstrajectory commitmentactivation patchingautoregressive generationbifurcation experimentstransformer modelscausal intervention

0 comments

The pith

Hallucination in language models arises as an early commitment to a false generation trajectory governed by asymmetric attractor dynamics.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to establish that when transformers generate text one token at a time, they can rapidly lock into a path that produces factual errors, and that this lock-in is easier to enter than to reverse because the internal dynamics favor one direction over the other. A reader would care because this turns hallucinations from mysterious slips into a structural feature of how the model explores its possible outputs from the very first step. If the claim holds, it points to interventions that target the initial commitment rather than trying to correct errors after they appear. The evidence rests on running the same prompt many times to watch outputs split, then swapping internal states between correct and hallucinated runs to test which direction of swap works more readily.

Core claim

What carries the argument

same-prompt bifurcation combined with targeted activation patching that exposes asymmetric stability between correct and hallucinated generation paths

If this is right

Prompt encodings already visible at step zero can forecast which inputs will produce high hallucination rates.
A single perturbation can push the model into a hallucinated basin, but sustained intervention across multiple steps is required to pull it out.
The basins themselves cluster into a small number of regime-like groups identifiable from the prompt representation alone.
Bifurcation happens at the first token for a substantial fraction of prompts, separating factual and false paths before further generation occurs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the asymmetry persists in larger models, early steering of the residual stream could become a practical defense against hallucination.
The same commitment mechanism may underlie other generation failures such as repetitive loops or sudden topic shifts.
Training objectives that penalize rapid basin entry could flatten the attractors without changing model scale.

Load-bearing premise

The interventions used to swap activations between trajectories do not create artificial effects that differ from the model's ordinary dynamics, and the observed patterns hold beyond the tested model size and prompt set.

What would settle it

Finding that a single activation patch corrects a hallucinated trajectory as readily as it corrupts a correct one would falsify the claimed asymmetry.

Figures

Figures reproduced from arXiv: 2604.15400 by G. Aytug Akarlar.

**Figure 2.** Figure 2: Per-prompt correct rate (above axis) and hallucination rate (below axis) for all 61 [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Step-wise KL divergence across all 24 bifurcating prompts with trajectory data. Thin [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Hidden-state separation (Cohen’s d) across layers and generation steps for four representative prompts spanning different hallucination categories. All share the same structure: step 0 is identically zero (black), divergence initiates in upper layers (L20–L27) at step 1, and separation grows monotonically with no reconvergence. This pattern is consistent across all 24 bifurcating prompts (individual heatm… view at source ↗

**Figure 5.** Figure 5: PCA trajectory projections at five layers for “Since the Amazon River flows through [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Activation patching layer sweep at step 1 ( [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗

**Figure 7.** Figure 7: Step sweep at layer 20. The correction peak occurs at step 1 (29.2%), while corruption [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗

**Figure 8.** Figure 8: Window patching at layer 20. Correction (green) improves monotonically from 12.5% [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗

**Figure 9.** Figure 9: K-means k = 5 clustering on h (15) 0 projected onto its first two principal components. Left: cluster assignments with per-cluster size and mean r(P). Right: the same points colored by observed r(P), with cluster centroids labeled. The narrative cluster (C1, orange) and saddle cluster (C2, green) are spatially separated; retrieval (C3) and computation (C4) are compact low-r groups. The saddle cluster local… view at source ↗

**Figure 10.** Figure 10: Layer sweep of step-0 regime probe on 61 prompts. [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗

**Figure 11.** Figure 11: Between-cluster variance fraction η 2 on r(P) as a function of k, for K-means and GMM. Both methods peak at k = 5 with η 2 ≈ 0.55, indicating that five clusters best summarize the hallucination-relevant structure in step-0 encodings. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_11.png] view at source ↗

**Figure 12.** Figure 12: Permutation null distribution at layer 15, 1000 label shuffles. The observed Pearson [PITH_FULL_IMAGE:figures/full_fig_p021_12.png] view at source ↗

read the original abstract

We present causal evidence that hallucination in autoregressive language models is an early trajectory commitment governed by asymmetric attractor dynamics. Using same-prompt bifurcation, in which we repeatedly sample identical inputs to observe spontaneous divergence, we isolate trajectory dynamics from prompt-level confounds. On Qwen2.5-1.5B across 61 prompts spanning six categories, 27 prompts (44.3%) bifurcate with factual and hallucinated trajectories diverging at the first generated token (KL = 0 at step 0, KL > 1.0 at step 1). Activation patching across 28 layers reveals a pronounced causal asymmetry: injecting a hallucinated activation into a correct trajectory corrupts output in 87.5% of trials (layer 20), while the reverse recovers only 33.3% (layer 24); both exceed the 10.4% baseline (p = 0.025) and 12.5% random-patch control. Window patching shows correction requires sustained multi-step intervention, whereas corruption needs only a single perturbation. Probing the prompt encoding itself, step-0 residual states predict per-prompt hallucination rate at Pearson r = 0.776 at layer 15 (p < 0.001 against a 1000-permutation null); unsupervised clustering identifies five regime-like groups (eta^2 = 0.55) whose saddle-adjacent cluster concentrates 12 of the 13 bifurcating false-premise prompts, indicating that the basin structure is organized around regime commitments fixed at prompt encoding. These findings characterize hallucination as a locally stable attractor basin: entry is probabilistic and rapid, exit demands coordinated intervention across layers and steps, and the relevant basins are selected by clusterable regimes already discernible at step 0.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Early commitment to hallucinations via bifurcation and patching asymmetry, though the dynamical interpretation has some gaps.

read the letter

The main thing to know is that the authors present evidence from repeated sampling and activation patching that hallucinations tend to lock in very early, with some asymmetry in how interventions affect correct versus hallucinated paths. They introduce a same-prompt bifurcation method that lets them observe divergence without changing the input, and on their 61 prompts about 44% split immediately. The patching across layers shows stronger effects when pushing toward hallucination than away from it, beating random controls. The prompt encoding probe and clustering add supporting correlations. This setup has some strengths in using controls and reporting stats like p-values. It moves beyond pure observation to some interventional tests. However, the attractor dynamics claim is the weaker part. Patching activations from one trajectory to another can create unnatural states that the model wasn't trained to handle, so the asymmetry might reflect sensitivity to disruption rather than stable basins. The results are also confined to one model and a modest prompt set, with no code or full data for reproduction. The window patching is a good idea but doesn't fully address the artifact concern. This paper would interest people doing interpretability work on generation dynamics and hallucination. A reader focused on practical fixes might pick up the early commitment idea, but they'd want to see more validation. It is worth sending to a serious referee. The question is timely and the methods have potential, even if the current evidence leaves room for alternative explanations. My recommendation is to engage with it in review rather than desk reject.

Referee Report

3 major / 2 minor

Summary. The manuscript claims that hallucinations in autoregressive transformers arise from early, probabilistic commitment to asymmetric attractor basins in the generation trajectory. Using same-prompt bifurcation on Qwen2.5-1.5B across 61 prompts, it reports that 27 prompts (44.3%) spontaneously diverge into factual versus hallucinated paths at the first generated token (KL>1.0). Activation patching across layers shows strong directional asymmetry (hallucinated-into-correct corrupts 87.5% at layer 20; reverse recovers only 33.3% at layer 24, both p=0.025 vs. baseline/random controls), while window patching indicates that exit from the hallucinated basin requires sustained multi-step intervention. Prompt-level residual states at step 0 predict per-prompt hallucination rates (r=0.776 at layer 15) and cluster into regime-like groups whose saddle-adjacent cluster concentrates false-premise bifurcators.

Significance. If the causal claims hold, the work supplies one of the first interventional, dynamical-systems accounts of hallucination, distinguishing entry (rapid, probabilistic) from exit (coordinated, sustained) and linking both to prompt-encoding structure. The multi-method design (bifurcation + patching + predictive probing + clustering) with statistical controls is a strength; reproducible code or public data would further increase impact for detection and mitigation research.

major comments (3)

[activation patching results] Activation patching experiments (described in the results on directional asymmetry): the observed 87.5% vs. 33.3% asymmetry could arise from differential sensitivity to out-of-distribution activations rather than stable attractor basins. Direct replacement of activations can disrupt attention keys/values and residual norms independently of natural dynamics; additional controls (e.g., patching with activations drawn from other correct trajectories or noise-matched interventions) are needed to isolate the claimed causal role of attractor structure.
[window patching and bifurcation analysis] Bifurcation and window-patching sections: the claim that correction requires sustained multi-step intervention while corruption needs only a single step is load-bearing for the asymmetric-attractor interpretation, yet the manuscript does not report how layer and window sizes were chosen or whether they were pre-specified versus selected after observing the data. Post-hoc selection risks inflating the apparent asymmetry.
[prompt encoding probe and unsupervised clustering] Prompt-encoding probe and clustering (layer 15 residual states, eta^2=0.55): while the Pearson r=0.776 and concentration of false-premise prompts in one cluster are suggestive, the analysis must demonstrate that the clusters are not confounded by surface features such as prompt length, lexical overlap, or category (the six categories are mentioned but not balanced in the clustering validation).

minor comments (2)

[experimental setup] The 61-prompt corpus and its six categories should be described with explicit inclusion criteria and balance statistics; without this, it is difficult to assess selection bias in the 44.3% bifurcation rate.
[figures and tables] Figure legends and table captions should explicitly state the exact statistical tests, permutation counts, and multiple-comparison corrections used for the p=0.025 and r=0.776 results.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The suggestions highlight important methodological considerations that we will address through targeted revisions and additional analyses. We respond to each major comment below.

read point-by-point responses

Referee: [activation patching results] Activation patching experiments (described in the results on directional asymmetry): the observed 87.5% vs. 33.3% asymmetry could arise from differential sensitivity to out-of-distribution activations rather than stable attractor basins. Direct replacement of activations can disrupt attention keys/values and residual norms independently of natural dynamics; additional controls (e.g., patching with activations drawn from other correct trajectories or noise-matched interventions) are needed to isolate the claimed causal role of attractor structure.

Authors: We agree that additional controls would further bolster the interpretation. Our existing random-patch control yields a corruption rate (12.5%) statistically indistinguishable from the no-patch baseline (10.4%), suggesting that non-specific disruptions do not drive the effect. The pronounced directional asymmetry (87.5% vs. 33.3%) is observed consistently across multiple layers and exceeds both controls with p=0.025. To directly address the referee's concern, we will add in the revision: (i) patching using activations sampled from other correct trajectories within the same prompt set, and (ii) noise-matched interventions scaled to match the norm of the target activations. These will be reported alongside the original results. revision: yes
Referee: [window patching and bifurcation analysis] Bifurcation and window-patching sections: the claim that correction requires sustained multi-step intervention while corruption needs only a single step is load-bearing for the asymmetric-attractor interpretation, yet the manuscript does not report how layer and window sizes were chosen or whether they were pre-specified versus selected after observing the data. Post-hoc selection risks inflating the apparent asymmetry.

Authors: We acknowledge the importance of transparency in experimental design choices. The layers for peak effects (layer 20 for corruption, layer 24 for recovery) were identified by scanning all 28 layers and selecting those with maximal directional difference; window sizes were set to 1 for single-step and 3-5 for multi-step based on initial observations of trajectory divergence. This process was not fully pre-specified. In the revised manuscript, we will: (1) explicitly describe the layer and window selection procedure, (2) include a supplementary analysis showing the asymmetry holds across a range of window sizes (e.g., 1-7 steps), and (3) report results for all layers to demonstrate that the chosen layers are representative rather than cherry-picked. We believe these additions will confirm the robustness of the asymmetric dynamics. revision: yes
Referee: [prompt encoding probe and unsupervised clustering] Prompt-encoding probe and clustering (layer 15 residual states, eta^2=0.55): while the Pearson r=0.776 and concentration of false-premise prompts in one cluster are suggestive, the analysis must demonstrate that the clusters are not confounded by surface features such as prompt length, lexical overlap, or category (the six categories are mentioned but not balanced in the clustering validation).

Authors: We appreciate this point on potential confounds. The clustering was performed on residual states at layer 15, which capture semantic and structural information beyond surface features. However, to rule out confounds, we will add in the revision: (a) Pearson correlations between cluster labels and prompt length, showing no significant association; (b) analysis of lexical overlap (e.g., Jaccard similarity) across clusters, demonstrating that the saddle-adjacent cluster's enrichment for false-premise prompts persists after controlling for overlap; and (c) a balanced sub-sampling or category-stratified clustering validation to confirm that the concentration (12 of 13 false-premise bifurcators) is not an artifact of category imbalance. These controls will be included to strengthen the claim that the regime-like groups reflect attractor-relevant structure in the prompt encoding. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper presents purely empirical results from interventional experiments (same-prompt bifurcation on 61 prompts, activation patching across 28 layers, window patching, prompt-encoding probes with Pearson r and eta^2 clustering). No equations, derivations, fitted parameters renamed as predictions, self-citations, or ansatzes appear in the abstract or described methods. Claims rest on observed data, statistical controls (p=0.025, permutation nulls), and direct comparisons to baselines/random controls rather than reducing to self-referential definitions or load-bearing self-citations. This matches the default expectation for non-circular empirical work.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The paper introduces a dynamical systems interpretation without additional free parameters fitted in the abstract; relies on standard ML experimental practices.

axioms (1)

standard math Standard assumptions for statistical significance testing (e.g., independence in permutations)
Used for p < 0.001 and p=0.025 calculations.

invented entities (1)

asymmetric attractor dynamics no independent evidence
purpose: Explains the observed asymmetry in corruption vs correction via patching and the stability of hallucinated trajectories
This is the interpretive framework proposed based on the experimental outcomes.

pith-pipeline@v0.9.0 · 5628 in / 1402 out tokens · 46996 ms · 2026-05-10T10:57:17.525547+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

14 extracted references · 7 canonical work pages · 6 internal anchors

[1]

Survey of hallucination in natural language generation,

Z. Ji et al., “Survey of hallucination in natural language generation,”ACM Computing Surveys, vol. 55, no. 12, 2023

2023
[2]

A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions

L. Huang et al., “A survey on hallucination in large language models,”arXiv:2311.05232, 2023

work page internal anchor Pith review arXiv 2023
[3]

The internal state of an LLM knows when it’s lying,

A. Azaria and T. Mitchell, “The internal state of an LLM knows when it’s lying,” inFindings of EMNLP, 2023

2023
[4]

Inference-time intervention: Eliciting truthful answers from a language model,

K. Li, O. Patel, F. Viegas, H. Pfister, and M. Wattenberg, “Inference-time intervention: Eliciting truthful answers from a language model,” inNeurIPS, 2023

2023
[5]

Language Models (Mostly) Know What They Know

S. Kadavath et al., “Language models (mostly) know what they know,”arXiv:2207.05221, 2022

work page internal anchor Pith review arXiv 2022
[6]

Representation Engineering: A Top-Down Approach to AI Transparency

A. Zou et al., “Representation engineering: A top-down approach to AI transparency,” arXiv:2310.01405, 2023

work page internal anchor Pith review arXiv 2023
[7]

Hallucination Basins: A Dynamic Framework for Understanding and Controlling LLM Hallucinations

K. Cherukuri and L. R. Varshney, “Hallucination basins,”arXiv:2604.04743, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[8]

Locating and editing factual associations in GPT,

K. Meng, D. Bau, A. Andonian, and Y. Belinkov, “Locating and editing factual associations in GPT,” inNeurIPS, 2022

2022
[9]

How to use and interpret activation patching.arXiv preprint arXiv:2404.15255,

S. Heimersheim and N. Nanda, “How to use and interpret activation patching,”arXiv:2404.15255, 2024

work page arXiv 2024
[10]

From noise to narrative,

P. Suresh, J. Stanley, S. Joseph, L. Scimeca, and D. Bzdok, “From noise to narrative,” inNeurIPS, 2025

2025
[11]

Projected Autoregression: Autoregressive Language Generation in Continuous State Space

O. Naparstek, “Projected autoregression: Autoregressive language generation in continuous state space,”arXiv:2601.04854, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[12]

Discovering latent knowledge in language models without supervision,

C. Burns, H. Ye, D. Klein, and J. Steinhardt, “Discovering latent knowledge in language models without supervision,” inICLR, 2023

2023
[13]

TransformerLens,

N. Nanda and J. Bloom, “TransformerLens,”https://github.com/TransformerLensOrg/ TransformerLens, 2022

2022
[14]

Qwen2.5 Technical Report

Qwen Team, “Qwen2.5 technical report,”arXiv:2412.15115, 2024. 15 A Full Bifurcation Results Table 3: Complete bifurcation results for all 61 prompts (N= 20 samples,τ= 0.7). C = Correct, H = Hallucination, O = Other.⋆= bifurcating. Idx Category C H O Bif. Prompt (truncated) 0 factual 0 3 17 The 23rd President of the United S. . . 1 factual 0 2 18 The 14th ...

work page internal anchor Pith review Pith/arXiv arXiv 2024