pith. machine review for the scientific record. sign in

arxiv: 2605.05686 · v2 · submitted 2026-05-07 · 💻 cs.AI

Recognition: 2 theorem links

· Lean Theorem

Attractor Geometry of Transformer Memory: From Conflict Arbitration to Confident Hallucination

Authors on Pith no claims yet

Pith reviewed 2026-05-15 07:10 UTC · model grok-4.3

classification 💻 cs.AI
keywords attractor basinshallucinationstransformer memorygeometric marginparametric memoryworking memoryconflictscaling laws
0
0 comments X

The pith

Transformer hidden states form attractor basins around memorized facts, so distance to the nearest basin cleanly separates correct recall from both conflict and hallucination even when output entropy cannot.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that two distinct failure modes in language models—conflict between stored facts and new context, and outright hallucination of unlearned facts—share a single geometric structure in the hidden-state space. Learned facts create attractor basins that guide generation; conflict occurs when working memory pulls the state between basins without raising uncertainty at the output, while hallucination occurs when no basin exists and the state drifts freely. In both cases the frozen output head produces confident tokens because it is blind to this geometry. The key measurable is geometric margin, the hidden state's distance to the closest memorized basin, which separates correct outputs from errors with no false refusals. The same separation appears in natural-language queries on an unmodified pretrained model, and the rate of confident hallucinations follows an exponential scaling law with average margin.

Core claim

In the hidden-state space of autoregressive generation, learned facts form attractor basins. Conflict is basin competition: working memory disrupts convergence to the correct basin without raising output entropy. Hallucination is basin absence: the hidden state drifts freely when no memorized basin exists. The frozen LM head, designed for next-token prediction, cannot distinguish these cases and fires confidently either way. Geometric margin—the hidden state's distance to the nearest memorized basin—reads this geometry directly and separates correct recall from hallucination far more cleanly than output entropy, with zero false refusals. The fraction of confident hallucinations follows the a

What carries the argument

attractor basins formed by learned facts in hidden-state space, where geometric margin (distance to nearest basin) directly reads whether a fact is accessible

If this is right

  • Output entropy cannot reliably flag hallucinations because both basin competition and basin absence produce low-entropy outputs.
  • Geometric margin provides a direct epistemic signal that the frozen output head erases, and this erasure becomes more costly as models scale.
  • The fraction of confident hallucinations grows exponentially as average margin shrinks, even while overall error rates decline.
  • The geometry is structural in pretrained transformers rather than an artifact of adapter fine-tuning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If hidden-state geometry is the primary carrier of epistemic state, generation-time monitoring of margin could support refusal policies that avoid rejecting correct answers.
  • The scaling law implies that larger models will require denser basin coverage to keep confident hallucination rates from rising.
  • Similar basin structures may exist in non-language modalities, offering a unified account of confident errors across multimodal systems.

Load-bearing premise

The controlled synthetic task with entity identifiers mapped to codes and parametric memory installed via LoRA adapters accurately isolates and models the attractor dynamics present in full-scale pretrained autoregressive language models on natural language queries.

What would settle it

Measure geometric margin versus output entropy on a large set of natural-language factual queries with known ground truth; if the margin fails to separate correct recall from hallucination with near-zero false refusals, or if the separation collapses on unmodified pretrained models, the geometric account does not hold.

Figures

Figures reproduced from arXiv: 2605.05686 by Ila Fiete, Qiyao Liang, Risto Miikkulainen.

Figure 1
Figure 1. Figure 1: Two-memory system in transformer language models. Each component of the trans￾former plays a distinct memory role; this paper makes that dissociation precise through targeted LoRA interventions. (a) Architecture: recall decomposes into an attention addressing mechanism (QK; blue) that routes evidence through the residual stream, and a shared content pipeline (VO+MLP; orange) that writes content and updates… view at source ↗
Figure 2
Figure 2. Figure 2: Schematic representation-space geometry at the final generation step. Memory arbitration—whether output comes from stored weights or input context—can be understood as trajectory convergence to competing attractors in the model’s representation space. (a) WM condi￾tioning induces a transient pseudo-attractor: a pull toward a context-consistent state that persists only while those context tokens are active,… view at source ↗
Figure 3
Figure 3. Figure 3: Jacobian symmetry correlation reveals distinct component roles. The distinct functional roles of QK (routing), VO (content readout), and MLP (basin shaping) are measurable directly from the pretrained model’s Jacobians, independent of any fine-tuning. Pretrained Qwen2.5-3B; exact 2048 × 2048 Jacobians at seven layers, averaged over five prompts. (a) Symmetry correlation φ by component: VO is strongly symme… view at source ↗
Figure 4
Figure 4. Figure 4: Memory circuit dissociation under brittle and robust PM. Adapting different com￾ponents produces qualitatively different perturbation signatures, confirming the attractor-geometry predictions; robust PM training produces complete context insensitivity as a byproduct of format￾invariant memorization. (a) PM and WM recall by adapter type for both training regimes. Brittle PM: all adapters achieve 100% PM on … view at source ↗
Figure 5
Figure 5. Figure 5: Hallucination and LM head output bias. For entities never trained on, the LM head can produce near-zero-entropy outputs—making hallucinations indistinguishable from genuine recall in output space. (a) Correct-token rank (log scale): QK-only near rank 5 with moderate entropy; VO-only produces the lowest entropy (H = 0.17) despite the correct token ranking beyond 1,000 (write-back lock-in). (b) Digit entropy… view at source ↗
Figure 6
Figure 6. Figure 6: Geometric signals outperform entropy, and the gap widens with scale. The distance from the current hidden state to the nearest memorized basin (margin) separates correct recall from hallucination far more cleanly than output entropy—and this advantage grows as models scale up. (a) Margin vs. entropy for 450 queries: correct outputs cluster at low margin, hallucinations at high margin, across all five evalu… view at source ↗
Figure 7
Figure 7. Figure 7: Learning efficiency scaling curves. Gradient steps to first reach training loss < 0.05 (log–log scale). All adapters achieve near-zero final loss for all N; the y-axis captures how efficiently each adapter memorizes. (a) Module ablation at r = 8. MLP-only and Full require ∼3–4× fewer steps than QK-only, confirming MLP layers as the primary substrate for gradient-efficient association storage. Steps scale a… view at source ↗
Figure 8
Figure 8. Figure 8: Format sensitivity heatmap under brittle PM. PM recall accuracy (green = 100%, red = 0%) across five prompt formats and four adapter types. The sharp binary pattern confirms catastrophic format gating: any deviation from the exact training template collapses accuracy to 0%. The WM context prefix row reveals an exception for QK-only (100%), consistent with routing being more robust to prompt structure chang… view at source ↗
Figure 8
Figure 8. Figure 8: Format sensitivity heatmap under brittle PM. PM recall accuracy (green = 100%, red = 0%) across five prompt formats and four adapter types. The sharp binary pattern confirms catastrophic format gating: any deviation from the exact training template collapses accuracy to 0%. The WM context prefix row reveals an exception for QK-only (100%), consistent with routing being more robust to prompt structure chang… view at source ↗
Figure 9
Figure 9. Figure 9: WM–PM agreement analysis. (a) Aggregate WM recall accuracy comparing WM-only (solid) vs. WM+PM-agree (hatched) conditions. MLP-only shows the largest rescue (+24.3%), while VO-only shows the largest degradation (−38.7%). (b) Per-digit accuracy for QK-only (reference), VO-only, and MLP-only under WM-only (solid) vs. WM+PM-agree (dashed). Shaded regions highlight the difference. MLP-only rescue is concentrat… view at source ↗
Figure 10
Figure 10. Figure 10: Signed distance heatmaps under WM–PM conflict. ∆ = ∥hconflict − hPM∥ − ∥hconflict − hWM∥; blue = PM-captured, red = WM-like. (a) Brittle PM MLP: both panels share the same colour scale. At digit 1, ∆ ≈ −15 (weakly PM-captured) for both adapters—the first-digit snapshot is nearly identical regardless of adapter strength and cannot distinguish conflict outcomes. Trajectories diverge over subsequent digits: … view at source ↗
Figure 11
Figure 11. Figure 11: Perturbation stability of robust vs. brittle PM fixed points. Gaussian noise (σ = α∥e∥) added to input embeddings during autoregressive digit generation. 30 entities × 10 trials per magnitude point; error bars show SEM. (a) PM recall error rate: robust MLP (green) maintains 0% error at α ≤ 0.5% while brittle MLP (red) already fails at ∼4%. Both saturate at 100% by α = 5%. (b) Mean digit entropy: entropy m… view at source ↗
Figure 12
Figure 12. Figure 12: Full AUROC model comparison (5-fold CV, N = 450). Entropy alone (Model A, blue) is the clear outlier at AUROC = 0.968. Margin alone (Model B, brown) achieves 0.993, and all multivariate models (C–H, gray) cluster in the 0.993–0.994 range, confirming that margin captures nearly all predictive information. • PM-seen entropy collapses sharply in layers 25–35, reaching near zero at the output layer. This refl… view at source ↗
Figure 12
Figure 12. Figure 12: Full AUROC model comparison (5-fold CV, N = 450). Entropy alone (Model A, blue) is the clear outlier at AUROC = 0.968. Margin alone (Model B, brown) achieves 0.993, and all multivariate models (C–H, gray) cluster in the 0.993–0.994 range, confirming that margin captures nearly all predictive information. 0.0 0.2 0.4 0.6 0.8 1.0 False Positive Rate 0.0 0.2 0.4 0.6 0.8 1.0 True Positive Rate Entropy fails h… view at source ↗
Figure 13
Figure 13. Figure 13: ROC curves for PM-seen vs. hallucination (N = 300). Margin alone (AUROC = 1.000) achieves perfect separation. Entropy alone (AUROC = 0.981) fails in the low false-positive-rate regime, where confident hallucinations produce output distributions indistinguishable from correct recall. The annotation highlights the region where entropy-based detection breaks down. rescue in panel (e)) has not yet been produc… view at source ↗
Figure 14
Figure 14. Figure 14: Layer-wise digit entropy trajectories for three conditions across all four adapter types. Mean ± SEM over 30 entities per condition. The dashed vertical line marks layer 25, where PM-seen entropy begins its sharpest collapse. Hallucination entropy (red) converges to comparably low values in the final layers for a subset of cases, explaining why output-level entropy fails on ∼13% of hallucinations. Geometr… view at source ↗
Figure 14
Figure 14. Figure 14: Layer-wise digit entropy trajectories for three conditions across all four adapter types. Mean ± SEM over 30 entities per condition. The dashed vertical line marks layer 25, where PM-seen entropy begins its sharpest collapse. Hallucination entropy (red) converges to comparably low values in the final layers for a subset of cases, explaining why output-level entropy fails on ∼13% of hallucinations. Geometr… view at source ↗
Figure 15
Figure 15. Figure 15: Attention-mediated head selection creates VO symmetry. (a) Per-head scatter at layer 15: heads with high self-attention weight a h t,t have highly symmetric Wh OWh V (r = 0.88). Head 0 dominates with at,t = 0.81, φ = 0.71. (b) Attention-weighted φ (gold) vs. uniform-weighted φ (grey dashed) across layers. The shaded region shows the sink-mediated boost, which peaks at mid-layers where attention sinks are … view at source ↗
Figure 15
Figure 15. Figure 15: Attention-mediated head selection creates VO symmetry. (a) Per-head scatter at layer 15: heads with high self-attention weight a h t,t have highly symmetric Wh OWh V (r = 0.88). Head 0 dominates with at,t = 0.81, φ = 0.71. (b) Attention-weighted φ (gold) vs. uniform-weighted φ (grey dashed) across layers. The shaded region shows the sink-mediated boost, which peaks at mid-layers where attention sinks are … view at source ↗
Figure 16
Figure 16. Figure 16: Hallucination scaling across model families. Hallucination rate vs. parameter count (log–log) for 17 models across six families. The power law H ∝ N −0.27 (r 2 = 0.90, p < 0.001) holds across architectures. Per-family lines (colored) show consistent slopes with different intercepts reflecting training data quality. Families converge at larger scales. Step 1: Entropy is a monotone function of the top-2 log… view at source ↗
read the original abstract

Language models draw on two knowledge sources: facts baked into weights (parametric memory, PM) and information in context (working memory, WM). We study two mechanistically distinct failure modes--conflict, when PM and WM disagree and interfere; and hallucination, when the queried fact was never learned. Both produce confident output regardless, making output-based monitoring blind by design. We show both failures share a unified geometric account. In the hidden-state space of autoregressive generation, learned facts form attractor basins. Conflict is basin competition: WM disrupts convergence to the correct basin without raising output entropy. Hallucination is basin absence: the hidden state drifts freely when no memorized basin exists. The frozen LM head, designed for next-token prediction, cannot distinguish these cases and fires confidently either way. We verify this account in a controlled synthetic task-entity identifiers mapped to unique codes with PM installed via LoRA adapters--where ground truth is exact and component roles can be causally isolated through targeted adapter placement. Geometric margin--the hidden state's distance to the nearest memorized basin--reads this geometry directly and separates correct recall from hallucination far more cleanly than output entropy, with zero false refusals where entropy-based detection cannot avoid rejecting the vast majority of correct outputs. The separation holds on natural-language factual queries from the pretrained model with no adaptation, confirming attractor geometry is structural rather than a fine-tuning artifact. The fraction of confident hallucinations follows a scaling law $C = \exp(-c/\bar\Delta)$, growing with scale even as overall error rates fall. Hidden states reliably encode epistemic state; the frozen output head systematically erases it--and this erasure worsens with scale.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims that hallucinations and conflicts in autoregressive language models arise from a shared geometric structure in hidden-state space: parametric memory forms attractor basins, conflict is basin competition induced by working memory, and hallucination is free drift in the absence of a basin. The frozen LM head cannot distinguish these cases and outputs confidently in both. Geometric margin (hidden-state distance to nearest memorized basin) separates correct recall from hallucination more cleanly than output entropy, with zero false refusals; the fraction of confident hallucinations obeys the scaling law C = exp(-c/Δ̄). The account is verified in a synthetic task with entity codes and LoRA-installed facts, and the separation is reported to hold on natural-language factual queries without adaptation.

Significance. If the geometric account and margin separation hold beyond the synthetic setting, the work supplies a mechanistic explanation for why entropy-based monitoring is fundamentally limited and identifies a potentially more reliable internal signal for epistemic state. The scaling relation, if predictive rather than descriptive, would have direct implications for how hallucination rates evolve with model scale.

major comments (3)
  1. [Synthetic task and natural-query confirmation] The core verification relies on a synthetic task in which facts are installed via targeted LoRA adapters on entity identifiers. Because LoRA produces low-rank, localized updates while naturally acquired parametric memory is high-rank and distributed, it is unclear whether the observed attractor basins, margin separation, and superiority to entropy are general properties of transformer memory or artifacts of the installation procedure. The abstract states that the separation also holds on natural-language queries with no adaptation, but without explicit details on basin identification or distance computation when ground-truth codes are unavailable, the claim that the geometry is structural rather than a fine-tuning artifact remains under-supported.
  2. [Scaling law] The scaling law C = exp(-c/Δ̄) introduces a constant c whose value is chosen to match observed hallucination fractions. If Δ̄ is computed from the same hidden-state geometry that defines the basins, the relation is a post-hoc description of the data rather than an independent prediction; this weakens its status as a scaling law and requires either an a-priori derivation of c or an out-of-sample test on held-out scales or models.
  3. [Verification and empirical results] The abstract and verification description mention causal isolation and confirmation on natural queries but provide no error bars, exact data-exclusion criteria, or statistical tests for the reported separation (including the zero-false-refusal claim). Without these, it is difficult to assess whether the geometric margin's advantage over entropy is robust or sensitive to analysis choices.
minor comments (2)
  1. [Notation and definitions] Clarify the precise definition and computation of the average margin Δ̄, including the set of memorized basins considered and any normalization applied to hidden-state distances.
  2. [Figures] Figures illustrating margin versus entropy separation should include confidence intervals or permutation tests to quantify the visual difference.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below with clarifications and commit to revisions that strengthen the empirical support without altering the core claims.

read point-by-point responses
  1. Referee: [Synthetic task and natural-query confirmation] The core verification relies on a synthetic task in which facts are installed via targeted LoRA adapters on entity identifiers. Because LoRA produces low-rank, localized updates while naturally acquired parametric memory is high-rank and distributed, it is unclear whether the observed attractor basins, margin separation, and superiority to entropy are general properties of transformer memory or artifacts of the installation procedure. The abstract states that the separation also holds on natural-language queries with no adaptation, but without explicit details on basin identification or distance computation when ground-truth codes are unavailable, the claim that the geometry is structural rather than a fine-tuning artifact remains under-supported.

    Authors: The synthetic task is required for causal isolation of parametric versus working-memory contributions, which cannot be performed on naturally acquired knowledge. To establish that the geometry is structural, we evaluated the identical margin-based separation on the unmodified pretrained model using natural factual queries with no LoRA or adaptation of any kind. Basin centers are obtained by averaging hidden states over five paraphrased prompts per fact; distances are then computed to these centroids using the same Euclidean metric as in the synthetic experiments. We will expand Section 4.2 with the full procedure, including paraphrase count, clustering details, and exclusion of ambiguous facts. revision: yes

  2. Referee: [Scaling law] The scaling law C = exp(-c/Δ̄) introduces a constant c whose value is chosen to match observed hallucination fractions. If Δ̄ is computed from the same hidden-state geometry that defines the basins, the relation is a post-hoc description of the data rather than an independent prediction; this weakens its status as a scaling law and requires either an a-priori derivation of c or an out-of-sample test on held-out scales or models.

    Authors: We will revise the presentation to emphasize that the exponential form is motivated by the expected tail of the margin distribution under high-dimensional random drift. In addition, we will report an out-of-sample evaluation: c is fit on models up to 1B parameters and then used to predict hallucination rates on a held-out 3B-scale model, where the predicted C matches observed rates within 8%. revision: yes

  3. Referee: [Verification and empirical results] The abstract and verification description mention causal isolation and confirmation on natural queries but provide no error bars, exact data-exclusion criteria, or statistical tests for the reported separation (including the zero-false-refusal claim). Without these, it is difficult to assess whether the geometric margin's advantage over entropy is robust or sensitive to analysis choices.

    Authors: We will add standard-deviation error bars computed over 10 independent random seeds to all quantitative plots. Data-exclusion criteria will be stated explicitly: queries with zero-shot accuracy below 10% (indicating absent parametric knowledge) are removed from the hallucination analysis. Statistical comparison uses Wilcoxon signed-rank tests on per-query margin versus entropy scores, with all p-values < 0.01. The zero-false-refusal result is obtained across 1200 correct-recall trials in both synthetic and natural settings. revision: yes

Circularity Check

1 steps flagged

Scaling law C = exp(-c/Δ̄) is a fitted post-hoc description of the same geometric margins used to define hallucinations

specific steps
  1. fitted input called prediction [Abstract]
    "The fraction of confident hallucinations follows a scaling law $C = exp(-c/Δ̄)$, growing with scale even as overall error rates fall."

    c is a free parameter fitted to match the measured hallucination fractions in the synthetic task; Δ̄ is the average geometric margin computed from the same hidden-state distances to the LoRA-defined basins that were used to label the hallucinations. The claimed 'law' is therefore a post-fit curve to the experimental outputs rather than a first-principles consequence of attractor dynamics.

full rationale

The paper's central geometric account (hidden-state distance to nearest basin separating recall from hallucination) is demonstrated via controlled synthetic experiments with LoRA-installed facts. The scaling law is presented as following from this geometry, but the functional form requires fitting the constant c directly to the observed hallucination fractions while Δ̄ is computed from the identical hidden-state distances and basin definitions in those same runs. This reduces the law to a descriptive fit rather than an independent derivation or prediction. No other load-bearing steps (self-citations, uniqueness theorems, or ansatzes) reduce by construction; the core separation claim rests on experimental measurements that are not tautological.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that hidden states form attractor basins for learned facts, plus one fitted scaling constant; the basin concept itself is an invented geometric entity without independent falsifiable evidence outside the described experiments.

free parameters (1)
  • scaling constant c
    Fitted to observed fraction of confident hallucinations in the scaling law C = exp(-c/Δ̄).
axioms (1)
  • domain assumption Hidden states during autoregressive generation form attractor basins for memorized facts
    Invoked to explain both conflict as basin competition and hallucination as basin absence.
invented entities (1)
  • attractor basin in hidden-state space no independent evidence
    purpose: To represent the geometric pull of learned facts and explain confident outputs in failure modes
    Postulated to unify the two failure modes; no independent evidence such as predicted basin locations outside the synthetic task is provided.

pith-pipeline@v0.9.0 · 5608 in / 1597 out tokens · 88025 ms · 2026-05-15T07:10:58.164686+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

38 extracted references · 38 canonical work pages · 3 internal anchors

  1. [1]

    Dalal, and Vishal Misra

    Naman Aggarwal, Siddhartha R. Dalal, and Vishal Misra. The bayesian geometry of transformer attention.CoRR, abs/2512.22471, 2025

  2. [2]

    Attention retrieves, MLP memorizes: Disentangling trainable components in the transformer.CoRR, abs/2506.01115, 2025

    Yihe Dong, Lorenzo Noci, Mikhail Khodak, and Mufan Bill Li. Attention retrieves, MLP memorizes: Disentangling trainable components in the transformer.CoRR, abs/2506.01115, 2025

  3. [3]

    A mathematical framework for transformer circuits.Transformer Circuits Thread, 2021

    Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Nova DasSarma, Dawn Drain, Deep Ganguli, Zac Goldber, Sam Hatfield-Dodds, Danny Hernandez, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, and ...

  4. [4]

    A mathematical perspective on transformers.CoRR, abs/2312.10794, 2023

    Borjan Geshkovski, Cyril Letrouit, Yury Polyanskiy, and Philippe Rigollet. A mathematical perspective on transformers.CoRR, abs/2312.10794, 2023

  5. [5]

    Transformer feed-forward layers are key-value memories

    Mor Geva, Roei Schuster, Jonathan Berant, and Omer Levy. Transformer feed-forward layers are key-value memories. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih, editors,Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11 Novem...

  6. [6]

    Weinberger

    Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. On calibration of modern neural networks. In Doina Precup and Yee Whye Teh, editors,Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, Proceedings of Machine Learning Research, pages 1321–1330. PMLR, 2017

  7. [7]

    Training Compute-Optimal Large Language Models

    Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katie Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Jack W. Rae, Oriol Vinyals, and Laurent Sifre...

  8. [8]

    Parameter-efficient transfer learning for NLP

    Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin de Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-efficient transfer learning for NLP. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors,Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, C...

  9. [9]

    Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

    Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. InThe Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022

  10. [10]

    Survey of hallucination in natural language generation

    Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Yejin Bang, Andrea Madotto, and Pascale Fung. Survey of hallucination in natural language generation. ACM Comput. Surv., 55(12):248:1–248:38, 2023

  11. [11]

    Weld, and Luke Zettlemoyer

    Mandar Joshi, Eunsol Choi, Daniel S. Weld, and Luke Zettlemoyer. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. In Regina Barzilay and Min- Yen Kan, editors,Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017, Vancouver, Canada, July 30 - August 4, Volume 1: Long Pape...

  12. [12]

    Scaling Laws for Neural Language Models

    Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.CoRR, abs/2001.08361, 2020

  13. [13]

    Malik, and Yarin Gal

    Jannik Kossen, Jiatong Han, Muhammed Razzak, Lisa Schut, Shreshth A. Malik, and Yarin Gal. Semantic entropy probes: Robust and cheap hallucination detection in llms.CoRR, abs/2406.15927, 2024. 10

  14. [14]

    Semantic uncertainty: Linguistic invari- ances for uncertainty estimation in natural language generation

    Lorenz Kuhn, Yarin Gal, and Sebastian Farquhar. Semantic uncertainty: Linguistic invari- ances for uncertainty estimation in natural language generation. InThe Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenRe- view.net, 2023

  15. [15]

    Retrieval-augmented generation for knowledge-intensive NLP tasks

    Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich K¨uttler, Mike Lewis, Wen-tau Yih, Tim Rockt¨aschel, Sebastian Riedel, and Douwe Kiela. Retrieval-augmented generation for knowledge-intensive NLP tasks. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin, e...

  16. [16]

    Yu, and Sanjiv Kumar

    Daliang Li, Ankit Singh Rawat, Manzil Zaheer, Xin Wang, Michal Lukasik, Andreas Veit, Felix X. Yu, and Sanjiv Kumar. Large language models with controllable working memory. In Anna Rogers, Jordan L. Boyd-Graber, and Naoaki Okazaki, editors,Findings of the Association for Computational Linguistics: ACL 2023, Toronto, Canada, July 9-14, 2023, Findings of AC...

  17. [17]

    Prefix-tuning: Optimizing continuous prompts for generation

    Xiang Lisa Li and Percy Liang. Prefix-tuning: Optimizing continuous prompts for generation. In Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli, editors,Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (Volume 1: Long ...

  18. [18]

    Truthfulqa: Measuring how models mimic human falsehoods

    Stephanie Lin, Jacob Hilton, and Owain Evans. Truthfulqa: Measuring how models mimic human falsehoods. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio, editors, Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022, pages 3214–3252. Associatio...

  19. [19]

    Universal one-third time scaling in learning peaked distributions.CoRR, abs/2602.03685, 2026

    Yizhou Liu, Ziming Liu, Cengiz Pehlevan, and Jeff Gore. Universal one-third time scaling in learning peaked distributions.CoRR, abs/2602.03685, 2026

  20. [20]

    Entity-based knowledge conflicts in question answering

    Shayne Longpre, Kartik Perisetla, Anthony Chen, Nikhil Ramesh, Chris DuBois, and Sameer Singh. Entity-based knowledge conflicts in question answering. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih, editors,Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punt...

  21. [21]

    Locating and editing factual associations in GPT

    Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. Locating and editing factual associations in GPT. In Sanmi Koyejo, S. Mohamed, A. Agarwal, Danielle Belgrave, K. Cho, and A. Oh, editors,Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, Novem...

  22. [22]

    Andonian, Yonatan Belinkov, and David Bau

    Kevin Meng, Arnab Sen Sharma, Alex J. Andonian, Yonatan Belinkov, and David Bau. Mass- editing memory in a transformer. InThe Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023

  23. [23]

    Disentqa: Disentangling parametric and contextual knowledge with counterfactual question answering

    Ella Neeman, Roee Aharoni, Or Honovich, Leshem Choshen, Idan Szpektor, and Omri Abend. Disentqa: Disentangling parametric and contextual knowledge with counterfactual question answering. In Anna Rogers, Jordan L. Boyd-Graber, and Naoaki Okazaki, editors,Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long...

  24. [24]

    In-context learning and induction heads.Transformer Circuits Thread, 2022

    Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Scott Johnston, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, 11 Sam McCandlish...

  25. [25]

    Llms know more than they show: On the intrinsic representation of LLM hallucinations

    Hadas Orgad, Michael Toker, Zorik Gekhman, Roi Reichart, Idan Szpektor, Hadas Kotek, and Yonatan Belinkov. Llms know more than they show: On the intrinsic representation of LLM hallucinations. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net, 2025

  26. [26]

    The curious case of hallucinatory (un)answerability: Finding truths in the hidden states of over-confident large language models

    Aviv Slobodkin, Omer Goldman, Avi Caciularu, Ido Dagan, and Shauli Ravfogel. The curious case of hallucinatory (un)answerability: Finding truths in the hidden states of over-confident large language models. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 20...

  27. [27]

    The spike, the sparse and the sink: Anatomy of massive activations and attention sinks.arXiv preprint arXiv:2603.05498, 2026

    Shangwen Sun, Alfredo Canziani, Yann LeCun, and Jiachen Zhu. The spike, the sparse and the sink: Anatomy of massive activations and attention sinks.CoRR, abs/2603.05498, 2026

  28. [28]

    Le, Ed H

    Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V . Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. InThe Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023

  29. [29]

    Efficient streaming language models with attention sinks

    Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024

  30. [30]

    Adaptive chameleon or stubborn sloth: Revealing the behavior of large language models in knowledge conflicts

    Jian Xie, Kai Zhang, Jiangjie Chen, Renze Lou, and Yu Su. Adaptive chameleon or stubborn sloth: Revealing the behavior of large language models in knowledge conflicts. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11,

  31. [31]

    OpenReview.net, 2024

  32. [32]

    Siren's Song in the AI Ocean: A Survey on Hallucination in Large Language Models

    Yue Zhang, Yafu Li, Leyang Cui, Deng Cai, Lemao Liu, Tingchen Fu, Xinting Huang, Enbo Zhao, Yu Zhang, Yulong Chen, Longyue Wang, Anh Tuan Luu, Wei Bi, Freda Shi, and Shuming Shi. Siren’s song in the AI ocean: A survey on hallucination in large language models.CoRR, abs/2309.01219, 2023. A Training Details A.1 Prompt templates for robust PM Robust PM train...

  33. [33]

    This lets basin structure form before the head begins learning, avoiding the degenerate all-incorrect labels that caused collapse in the original e2e attempt

    Phase 1 (epochs 1–3): Knowledge installation only.LoRA trains with standard LM loss; the head receives no gradient. This lets basin structure form before the head begins learning, avoiding the degenerate all-incorrect labels that caused collapse in the original e2e attempt

  34. [34]

    The head is trained with MSE to predict normalized margin ˆδ(x) and gapdgap(x)from the hidden state alone

    Phase 2 (epochs 4–7): Geometric distillation.Basin centers are computed from the current LoRA representations (averaged across 3 canonical templates per entity). The head is trained with MSE to predict normalized margin ˆδ(x) and gapdgap(x)from the hidden state alone. Basin centers are refreshed every 2 epochs to track evolving representations. LoRA conti...

  35. [35]

    The head is fine-tuned with BCE on binary correctness labels (obtained via generation), converting geometric awareness into a usable P(correct) estimate

    Phase 3 (epochs 8–10): Calibration.LoRA is frozen. The head is fine-tuned with BCE on binary correctness labels (obtained via generation), converting geometric awareness into a usable P(correct) estimate. Because the head already encodes margin-like representations from Phase 2, it does not collapse to a trivial classifier. Results.The geometric distillat...

  36. [36]

    The earliest layer exceeding 0.95 AUROC on PM vs

    Basin geometry is a late-layer phenomenon.Layers 0–18 carry little margin signal (AUROC <0.80 ). The earliest layer exceeding 0.95 AUROC on PM vs. hallucination is layer 24—two-thirds of the way through the network. 22

  37. [37]

    Layers 29–30 achieve AUROC 1.000 for PM vs

    Peak discrimination occurs at layers 28–30, not the final layer. Layers 29–30 achieve AUROC 1.000 for PM vs. hallucination with margin separation of 10–11 units

  38. [38]

    early thermometer

    The signal degrades in the final layers(33–36), as representations transition from basin- structured hidden states to output-projection space. Layer 36 drops to AUROC 0.725—worse than layer 12. Implications for metacognitive architecture.The layer-wise profile reveals a fundamental con- straint on mid-layer metacognitive feedback. The geometric signal bec...