pith. machine review for the scientific record. sign in

arxiv: 2604.15400 · v1 · submitted 2026-04-16 · 💻 cs.LG · cs.AI· cs.CL

Recognition: unknown

Hallucination as Trajectory Commitment: Causal Evidence for Asymmetric Attractor Dynamics in Transformer Generation

Authors on Pith no claims yet

Pith reviewed 2026-05-10 10:57 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL
keywords hallucinationattractor dynamicstrajectory commitmentactivation patchingautoregressive generationbifurcation experimentstransformer modelscausal intervention
0
0 comments X

The pith

Hallucination in language models arises as an early commitment to a false generation trajectory governed by asymmetric attractor dynamics.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to establish that when transformers generate text one token at a time, they can rapidly lock into a path that produces factual errors, and that this lock-in is easier to enter than to reverse because the internal dynamics favor one direction over the other. A reader would care because this turns hallucinations from mysterious slips into a structural feature of how the model explores its possible outputs from the very first step. If the claim holds, it points to interventions that target the initial commitment rather than trying to correct errors after they appear. The evidence rests on running the same prompt many times to watch outputs split, then swapping internal states between correct and hallucinated runs to test which direction of swap works more readily.

Core claim

We present causal evidence that hallucination in autoregressive language models is an early trajectory commitment governed by asymmetric attractor dynamics. Using same-prompt bifurcation, in which we repeatedly sample identical inputs to observe spontaneous divergence, we isolate trajectory dynamics from prompt-level confounds. On Qwen2.5-1.5B across 61 prompts spanning six categories, 27 prompts bifurcate with factual and hallucinated trajectories diverging at the first generated token. Activation patching across layers reveals a pronounced causal asymmetry: injecting a hallucinated activation into a correct trajectory corrupts output far more often than the reverse recovers it, and window-

What carries the argument

same-prompt bifurcation combined with targeted activation patching that exposes asymmetric stability between correct and hallucinated generation paths

If this is right

  • Prompt encodings already visible at step zero can forecast which inputs will produce high hallucination rates.
  • A single perturbation can push the model into a hallucinated basin, but sustained intervention across multiple steps is required to pull it out.
  • The basins themselves cluster into a small number of regime-like groups identifiable from the prompt representation alone.
  • Bifurcation happens at the first token for a substantial fraction of prompts, separating factual and false paths before further generation occurs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the asymmetry persists in larger models, early steering of the residual stream could become a practical defense against hallucination.
  • The same commitment mechanism may underlie other generation failures such as repetitive loops or sudden topic shifts.
  • Training objectives that penalize rapid basin entry could flatten the attractors without changing model scale.

Load-bearing premise

The interventions used to swap activations between trajectories do not create artificial effects that differ from the model's ordinary dynamics, and the observed patterns hold beyond the tested model size and prompt set.

What would settle it

Finding that a single activation patch corrects a hallucinated trajectory as readily as it corrupts a correct one would falsify the claimed asymmetry.

Figures

Figures reproduced from arXiv: 2604.15400 by G. Aytug Akarlar.

Figure 1
Figure 1. Figure 1: Conceptual overview. From a shared initial state [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Per-prompt correct rate (above axis) and hallucination rate (below axis) for all 61 [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Step-wise KL divergence across all 24 bifurcating prompts with trajectory data. Thin [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Hidden-state separation (Cohen’s d) across layers and generation steps for four rep￾resentative prompts spanning different hallucination categories. All share the same structure: step 0 is identically zero (black), divergence initiates in upper layers (L20–L27) at step 1, and separation grows monotonically with no reconvergence. This pattern is consistent across all 24 bifurcating prompts (individual heatm… view at source ↗
Figure 5
Figure 5. Figure 5: PCA trajectory projections at five layers for “Since the Amazon River flows through [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Activation patching layer sweep at step 1 ( [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Step sweep at layer 20. The correction peak occurs at step 1 (29.2%), while corruption [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Window patching at layer 20. Correction (green) improves monotonically from 12.5% [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: K-means k = 5 clustering on h (15) 0 projected onto its first two principal components. Left: cluster assignments with per-cluster size and mean r(P). Right: the same points colored by observed r(P), with cluster centroids labeled. The narrative cluster (C1, orange) and saddle cluster (C2, green) are spatially separated; retrieval (C3) and computation (C4) are compact low-r groups. The saddle cluster local… view at source ↗
Figure 10
Figure 10. Figure 10: Layer sweep of step-0 regime probe on 61 prompts. [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Between-cluster variance fraction η 2 on r(P) as a function of k, for K-means and GMM. Both methods peak at k = 5 with η 2 ≈ 0.55, indicating that five clusters best summarize the hallucination-relevant structure in step-0 encodings. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Permutation null distribution at layer 15, 1000 label shuffles. The observed Pearson [PITH_FULL_IMAGE:figures/full_fig_p021_12.png] view at source ↗
read the original abstract

We present causal evidence that hallucination in autoregressive language models is an early trajectory commitment governed by asymmetric attractor dynamics. Using same-prompt bifurcation, in which we repeatedly sample identical inputs to observe spontaneous divergence, we isolate trajectory dynamics from prompt-level confounds. On Qwen2.5-1.5B across 61 prompts spanning six categories, 27 prompts (44.3%) bifurcate with factual and hallucinated trajectories diverging at the first generated token (KL = 0 at step 0, KL > 1.0 at step 1). Activation patching across 28 layers reveals a pronounced causal asymmetry: injecting a hallucinated activation into a correct trajectory corrupts output in 87.5% of trials (layer 20), while the reverse recovers only 33.3% (layer 24); both exceed the 10.4% baseline (p = 0.025) and 12.5% random-patch control. Window patching shows correction requires sustained multi-step intervention, whereas corruption needs only a single perturbation. Probing the prompt encoding itself, step-0 residual states predict per-prompt hallucination rate at Pearson r = 0.776 at layer 15 (p < 0.001 against a 1000-permutation null); unsupervised clustering identifies five regime-like groups (eta^2 = 0.55) whose saddle-adjacent cluster concentrates 12 of the 13 bifurcating false-premise prompts, indicating that the basin structure is organized around regime commitments fixed at prompt encoding. These findings characterize hallucination as a locally stable attractor basin: entry is probabilistic and rapid, exit demands coordinated intervention across layers and steps, and the relevant basins are selected by clusterable regimes already discernible at step 0.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript claims that hallucinations in autoregressive transformers arise from early, probabilistic commitment to asymmetric attractor basins in the generation trajectory. Using same-prompt bifurcation on Qwen2.5-1.5B across 61 prompts, it reports that 27 prompts (44.3%) spontaneously diverge into factual versus hallucinated paths at the first generated token (KL>1.0). Activation patching across layers shows strong directional asymmetry (hallucinated-into-correct corrupts 87.5% at layer 20; reverse recovers only 33.3% at layer 24, both p=0.025 vs. baseline/random controls), while window patching indicates that exit from the hallucinated basin requires sustained multi-step intervention. Prompt-level residual states at step 0 predict per-prompt hallucination rates (r=0.776 at layer 15) and cluster into regime-like groups whose saddle-adjacent cluster concentrates false-premise bifurcators.

Significance. If the causal claims hold, the work supplies one of the first interventional, dynamical-systems accounts of hallucination, distinguishing entry (rapid, probabilistic) from exit (coordinated, sustained) and linking both to prompt-encoding structure. The multi-method design (bifurcation + patching + predictive probing + clustering) with statistical controls is a strength; reproducible code or public data would further increase impact for detection and mitigation research.

major comments (3)
  1. [activation patching results] Activation patching experiments (described in the results on directional asymmetry): the observed 87.5% vs. 33.3% asymmetry could arise from differential sensitivity to out-of-distribution activations rather than stable attractor basins. Direct replacement of activations can disrupt attention keys/values and residual norms independently of natural dynamics; additional controls (e.g., patching with activations drawn from other correct trajectories or noise-matched interventions) are needed to isolate the claimed causal role of attractor structure.
  2. [window patching and bifurcation analysis] Bifurcation and window-patching sections: the claim that correction requires sustained multi-step intervention while corruption needs only a single step is load-bearing for the asymmetric-attractor interpretation, yet the manuscript does not report how layer and window sizes were chosen or whether they were pre-specified versus selected after observing the data. Post-hoc selection risks inflating the apparent asymmetry.
  3. [prompt encoding probe and unsupervised clustering] Prompt-encoding probe and clustering (layer 15 residual states, eta^2=0.55): while the Pearson r=0.776 and concentration of false-premise prompts in one cluster are suggestive, the analysis must demonstrate that the clusters are not confounded by surface features such as prompt length, lexical overlap, or category (the six categories are mentioned but not balanced in the clustering validation).
minor comments (2)
  1. [experimental setup] The 61-prompt corpus and its six categories should be described with explicit inclusion criteria and balance statistics; without this, it is difficult to assess selection bias in the 44.3% bifurcation rate.
  2. [figures and tables] Figure legends and table captions should explicitly state the exact statistical tests, permutation counts, and multiple-comparison corrections used for the p=0.025 and r=0.776 results.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The suggestions highlight important methodological considerations that we will address through targeted revisions and additional analyses. We respond to each major comment below.

read point-by-point responses
  1. Referee: [activation patching results] Activation patching experiments (described in the results on directional asymmetry): the observed 87.5% vs. 33.3% asymmetry could arise from differential sensitivity to out-of-distribution activations rather than stable attractor basins. Direct replacement of activations can disrupt attention keys/values and residual norms independently of natural dynamics; additional controls (e.g., patching with activations drawn from other correct trajectories or noise-matched interventions) are needed to isolate the claimed causal role of attractor structure.

    Authors: We agree that additional controls would further bolster the interpretation. Our existing random-patch control yields a corruption rate (12.5%) statistically indistinguishable from the no-patch baseline (10.4%), suggesting that non-specific disruptions do not drive the effect. The pronounced directional asymmetry (87.5% vs. 33.3%) is observed consistently across multiple layers and exceeds both controls with p=0.025. To directly address the referee's concern, we will add in the revision: (i) patching using activations sampled from other correct trajectories within the same prompt set, and (ii) noise-matched interventions scaled to match the norm of the target activations. These will be reported alongside the original results. revision: yes

  2. Referee: [window patching and bifurcation analysis] Bifurcation and window-patching sections: the claim that correction requires sustained multi-step intervention while corruption needs only a single step is load-bearing for the asymmetric-attractor interpretation, yet the manuscript does not report how layer and window sizes were chosen or whether they were pre-specified versus selected after observing the data. Post-hoc selection risks inflating the apparent asymmetry.

    Authors: We acknowledge the importance of transparency in experimental design choices. The layers for peak effects (layer 20 for corruption, layer 24 for recovery) were identified by scanning all 28 layers and selecting those with maximal directional difference; window sizes were set to 1 for single-step and 3-5 for multi-step based on initial observations of trajectory divergence. This process was not fully pre-specified. In the revised manuscript, we will: (1) explicitly describe the layer and window selection procedure, (2) include a supplementary analysis showing the asymmetry holds across a range of window sizes (e.g., 1-7 steps), and (3) report results for all layers to demonstrate that the chosen layers are representative rather than cherry-picked. We believe these additions will confirm the robustness of the asymmetric dynamics. revision: yes

  3. Referee: [prompt encoding probe and unsupervised clustering] Prompt-encoding probe and clustering (layer 15 residual states, eta^2=0.55): while the Pearson r=0.776 and concentration of false-premise prompts in one cluster are suggestive, the analysis must demonstrate that the clusters are not confounded by surface features such as prompt length, lexical overlap, or category (the six categories are mentioned but not balanced in the clustering validation).

    Authors: We appreciate this point on potential confounds. The clustering was performed on residual states at layer 15, which capture semantic and structural information beyond surface features. However, to rule out confounds, we will add in the revision: (a) Pearson correlations between cluster labels and prompt length, showing no significant association; (b) analysis of lexical overlap (e.g., Jaccard similarity) across clusters, demonstrating that the saddle-adjacent cluster's enrichment for false-premise prompts persists after controlling for overlap; and (c) a balanced sub-sampling or category-stratified clustering validation to confirm that the concentration (12 of 13 false-premise bifurcators) is not an artifact of category imbalance. These controls will be included to strengthen the claim that the regime-like groups reflect attractor-relevant structure in the prompt encoding. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper presents purely empirical results from interventional experiments (same-prompt bifurcation on 61 prompts, activation patching across 28 layers, window patching, prompt-encoding probes with Pearson r and eta^2 clustering). No equations, derivations, fitted parameters renamed as predictions, self-citations, or ansatzes appear in the abstract or described methods. Claims rest on observed data, statistical controls (p=0.025, permutation nulls), and direct comparisons to baselines/random controls rather than reducing to self-referential definitions or load-bearing self-citations. This matches the default expectation for non-circular empirical work.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The paper introduces a dynamical systems interpretation without additional free parameters fitted in the abstract; relies on standard ML experimental practices.

axioms (1)
  • standard math Standard assumptions for statistical significance testing (e.g., independence in permutations)
    Used for p < 0.001 and p=0.025 calculations.
invented entities (1)
  • asymmetric attractor dynamics no independent evidence
    purpose: Explains the observed asymmetry in corruption vs correction via patching and the stability of hallucinated trajectories
    This is the interpretive framework proposed based on the experimental outcomes.

pith-pipeline@v0.9.0 · 5628 in / 1402 out tokens · 46996 ms · 2026-05-10T10:57:17.525547+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

14 extracted references · 7 canonical work pages · 6 internal anchors

  1. [1]

    Survey of hallucination in natural language generation,

    Z. Ji et al., “Survey of hallucination in natural language generation,”ACM Computing Surveys, vol. 55, no. 12, 2023

  2. [2]

    A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions

    L. Huang et al., “A survey on hallucination in large language models,”arXiv:2311.05232, 2023

  3. [3]

    The internal state of an LLM knows when it’s lying,

    A. Azaria and T. Mitchell, “The internal state of an LLM knows when it’s lying,” inFindings of EMNLP, 2023

  4. [4]

    Inference-time intervention: Eliciting truthful answers from a language model,

    K. Li, O. Patel, F. Viegas, H. Pfister, and M. Wattenberg, “Inference-time intervention: Eliciting truthful answers from a language model,” inNeurIPS, 2023

  5. [5]

    Language Models (Mostly) Know What They Know

    S. Kadavath et al., “Language models (mostly) know what they know,”arXiv:2207.05221, 2022

  6. [6]

    Representation Engineering: A Top-Down Approach to AI Transparency

    A. Zou et al., “Representation engineering: A top-down approach to AI transparency,” arXiv:2310.01405, 2023

  7. [7]

    Hallucination Basins: A Dynamic Framework for Understanding and Controlling LLM Hallucinations

    K. Cherukuri and L. R. Varshney, “Hallucination basins,”arXiv:2604.04743, 2026

  8. [8]

    Locating and editing factual associations in GPT,

    K. Meng, D. Bau, A. Andonian, and Y. Belinkov, “Locating and editing factual associations in GPT,” inNeurIPS, 2022

  9. [9]

    How to use and interpret activation patching.arXiv preprint arXiv:2404.15255,

    S. Heimersheim and N. Nanda, “How to use and interpret activation patching,”arXiv:2404.15255, 2024

  10. [10]

    From noise to narrative,

    P. Suresh, J. Stanley, S. Joseph, L. Scimeca, and D. Bzdok, “From noise to narrative,” inNeurIPS, 2025

  11. [11]

    Projected Autoregression: Autoregressive Language Generation in Continuous State Space

    O. Naparstek, “Projected autoregression: Autoregressive language generation in continuous state space,”arXiv:2601.04854, 2026

  12. [12]

    Discovering latent knowledge in language models without supervision,

    C. Burns, H. Ye, D. Klein, and J. Steinhardt, “Discovering latent knowledge in language models without supervision,” inICLR, 2023

  13. [13]

    TransformerLens,

    N. Nanda and J. Bloom, “TransformerLens,”https://github.com/TransformerLensOrg/ TransformerLens, 2022

  14. [14]

    Qwen2.5 Technical Report

    Qwen Team, “Qwen2.5 technical report,”arXiv:2412.15115, 2024. 15 A Full Bifurcation Results Table 3: Complete bifurcation results for all 61 prompts (N= 20 samples,τ= 0.7). C = Correct, H = Hallucination, O = Other.⋆= bifurcating. Idx Category C H O Bif. Prompt (truncated) 0 factual 0 3 17 The 23rd President of the United S. . . 1 factual 0 2 18 The 14th ...