pith. sign in

arxiv: 2606.07559 · v1 · pith:H3N23R2Rnew · submitted 2026-05-25 · 💻 cs.CL · cs.AI· quant-ph

Phantom transitions in language model fine-tuning

Pith reviewed 2026-06-29 21:52 UTC · model grok-4.3

classification 💻 cs.CL cs.AIquant-ph
keywords near-synonym fine-tuningphantom transitionssoftmax discontinuityorder parameterLoRA adaptationdimensionless quantitieskinematic failurestructural failure
0
0 comments X

The pith

The apparent phase transitions in language model fine-tuning on near-synonym tasks are artifacts that live only in the softmax readout.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper studies fine-tuning failures where loss falls but the correct token fails to overtake a near-synonym competitor in rank. It defines an order parameter that splits the dynamics into a signal measuring commitment to the correct token and a drag term from embedding overlaps. Sharp jumps in this parameter resemble phase transitions yet persist when the embedding matrix is held fixed under LoRA training. A handful of dimensionless quantities derived from the order parameter organize trajectories across five architectures and predict the critical learning rate for a held-out model to within 2.1 percent.

Core claim

The transitions are phantoms. The spontaneous-symmetry-breaking interpretation is ruled out by direct measurement. Catapult-like jumps still appear under LoRA fine-tuning with the token embedding matrix exactly unchanged during training, where no geometric phase transition is possible. The discontinuity lives entirely in the softmax readout. A small number of dimensionless quantities organise the trajectory across architectures. One is consistent across all five under full fine-tuning. A second sorts architectures into two classes by bulk embedding distribution and predicts LoRA sufficiency. As a blind test, the framework predicts the critical learning rate of a held-out architecture to with

What carries the argument

The order parameter that combines predicted distribution with pairwise embedding overlaps and decomposes additively into signal and background drag.

If this is right

  • Catapult jumps persist when the embedding matrix remains exactly unchanged.
  • Architectures divide into two classes by bulk embedding distribution that predicts whether LoRA alone suffices.
  • One dimensionless quantity remains consistent across all tested models under full fine-tuning.
  • The same framework predicts critical learning rates for unseen architectures to within 2.1 percent.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Monitoring the signal-to-drag ratio during training could flag impending structural failures before loss plateaus.
  • Targeted adjustments to the softmax temperature or normalization might eliminate the phantom jumps without touching embeddings.
  • The decomposition into signal and drag may generalize to other probability outputs where competitors compete in rank.

Load-bearing premise

The ten hand-selected near-synonym contexts represent the general near-synonym failure regime and the order parameter captures the relevant dynamics without omitting other factors.

What would settle it

Repeating the LoRA runs with fixed embeddings on a fresh set of near-synonym contexts and checking whether the order-parameter jumps disappear.

Figures

Figures reproduced from arXiv: 2606.07559 by Jayasri Dontabhaktuni, Vaibhav Prakash.

Figure 1
Figure 1. Figure 1: Purity and Participation Ratio across three scenarios. Left: all probability on the correct token, purity = 1, PR = 1. Centre: uniform probability over four orthogonal tokens, purity = 0.25, PR = 4. Right: equal probability on g and a near-synonym competitor c with Ggc = 0.85, purity = 0.86, PR = 1.16 instead of 2. The geometric coupling inflates purity and artificially reduces PR. The model looks almost c… view at source ↗
Figure 2
Figure 2. Figure 2: The discrimination criterion and burial depth. The localisation length ξ is the angular radius of the prediction cloud on the unit embedding sphere. ξ > arccos(Gmax) (left, B > 1) means g and c both fall inside the cloud and the model is geometrically unresolved. Fine-tuning must physically compress the cloud to B < 1 (right) before Born scoring can resolve the competition. Pulling together the kinetics of… view at source ↗
Figure 3
Figure 3. Figure 3: Step-level Born-gap dynamics on SmolLM-360M FULL FT (η = 2 × 10−5 ). High-Gmax sentences (top, longing) show 9 quiescent steps then a single jump (ratio 33.7). Low-Gmax sentences (bottom, purpose) rise continuously from step 1 (ratio 4.3). Green dashed lines mark the flip step. A learning-rate sweep (η across 5 values, all 10 sentences on SmolLM-360M, 50 combinations) is cleanly separated by a single empir… view at source ↗
Figure 4
Figure 4. Figure 4: Causal isolation (DM-5d). Freezing the embedding matrix under LORA converts all sharp jumps into smooth trajectories and eliminates the predictive power of Gmax. 14 [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Hidden-state alignment under FULL FT: per-step ψg ·h (green), ψc ·h (red), and the logit gap (dashed black). The purple line marks the Φ flip step. All three curves evolve smoothly through the flip. The discontinuity resides entirely in the softmax readout. 5.1 Signal and Background Drag: Two Types of Failure The SmolLM-360M experiments resolve every sentence. The two failure types of §1, kinematic (drag i… view at source ↗
Figure 6
Figure 6. Figure 6: Universality class is determined by bulk embedding geometry, not parameter count. Left: Class A (distilgpt2 82M, GPT-2-medium 345M, SmolLM-360M) with dense Gaussian bulk, G 2 mean ∈ [0.045,0.097]. Right: Class B (Pythia-70M, Pythia-410M) with sparse exponential bulk, G 2 mean ∈ [0.002,0.003]. Pythia-70M (70M) is Class B while distilgpt2 (82M) is Class A. rate places every model in the sample at the same di… view at source ↗
Figure 7
Figure 7. Figure 7: HFT is consistent within our sample of five architectures (CV = 9.3%, n = 5) at η = 2×10−5 . All five models sit at H ≈ 10, roughly ten times above their own fitted saturation threshold, despite spanning two families, two geometry classes, and a 5× parameter range. This is an empirical regularity in our sample, not a proven physical law. Validation on additional architecture families is needed. two reading… view at source ↗
Figure 8
Figure 8. Figure 8: Cross-model laws linking geometry to LORA sufficiency. Left: BFT correlates with bulk embedding density (Spearman ρ = +0.90, p = 0.037). Denser bulk gives deeper burial. Right: BFT anti￾correlates with the reduced field H (Spearman ρ = −0.90, p = 0.037). More headroom above the saturation threshold gives shallower burial. Bulk geometry determines both θ ∗ and B. 7.1 Scope of the Experimental Dataset The te… view at source ↗
read the original abstract

Fine-tuning a language model on contexts whose correct completion has a near-synonym competitor often fails silently. The cross-entropy loss decreases monotonically while the correct token never overtakes the competitor in rank. We study this regime across five transformer architectures spanning two families and a fivefold parameter range, on ten hand-selected near-synonym contexts. We instrument these failures with an order parameter combining the predicted distribution and pairwise embedding overlaps. It decomposes additively into a signal, tracking the model's commitment to the correct token over its nearest competitor, and a background drag, set by how the embedding bulk leaks probability into the score. This isolates two failure modes. In kinematic failure the signal stays small. In structural failure the drag actively worsens as fine-tuning proceeds. We observe sharp catapult-like jumps in the order parameter that resemble a phase transition. A central negative result organises the paper. The transitions are phantoms. The spontaneous-symmetry-breaking interpretation is ruled out by direct measurement. Catapult-like jumps still appear under LoRA fine-tuning with the token embedding matrix exactly unchanged during training, where no geometric phase transition is possible. The discontinuity lives entirely in the softmax readout. A small number of dimensionless quantities organise the trajectory across architectures. One is consistent across all five under full fine-tuning. A second sorts architectures into two classes by bulk embedding distribution and predicts LoRA sufficiency. As a blind test, the framework predicts the critical learning rate of a held-out architecture, not used to fit any parameter, to within 2.1% of a subsequent learning-rate sweep. Findings concern the near-synonym mechanism only and should not be extrapolated without recalibration.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The paper examines silent failures during fine-tuning of language models on near-synonym contexts, where cross-entropy loss decreases but the correct token fails to overtake its competitor in rank. Across five transformer architectures (two families, fivefold parameter range) and ten hand-selected contexts, the authors introduce an order parameter that decomposes additively into a signal component (commitment to the correct token) and background drag (embedding bulk leakage). They report catapult-like jumps in this order parameter but demonstrate via LoRA fine-tuning with the token embedding matrix held exactly fixed that these are phantom transitions occurring entirely in the softmax readout, ruling out spontaneous symmetry breaking in embedding geometry. Dimensionless quantities organize trajectories across architectures, one consistent under full fine-tuning and another sorting by bulk embedding distribution to predict LoRA sufficiency; a blind test predicts the critical learning rate of a held-out architecture to within 2.1%.

Significance. If the central negative result holds, the work supplies a mechanistic account of a common fine-tuning failure mode, isolating the discontinuity to the readout and providing non-circular support through the fixed-embedding LoRA condition, cross-architecture consistency of dimensionless quantities, and successful blind prediction. These elements (multi-architecture span, direct test of the geometric hypothesis, and out-of-sample prediction) strengthen the phantom diagnosis for the measured cases and could guide targeted interventions in near-synonym regimes.

minor comments (3)
  1. [Abstract / Experiments] The abstract and methods should explicitly state the selection criteria or sampling procedure for the ten near-synonym contexts to support reproducibility and allow readers to assess representativeness of the failure regime.
  2. [Order parameter section] The order parameter is described as combining predicted distribution and pairwise embedding overlaps; formalizing its exact definition (including any normalization or weighting) with an equation in the main text would improve clarity.
  3. [Results figures] Figures reporting the order parameter trajectories and learning-rate sweeps should include explicit discussion of error bars, exclusion criteria, and statistical significance of the observed jumps.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their careful summary of the work and for the positive recommendation to accept. The report correctly identifies the central negative result (phantom transitions confined to the softmax readout) and the supporting elements (fixed-embedding LoRA controls, cross-architecture dimensionless quantities, and blind prediction). No major comments were raised that require response or revision.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper's central claims rest on direct experimental measurements (LoRA runs with token embeddings held exactly fixed, ruling out geometric phase transitions) and a blind prediction of critical learning rate on a held-out architecture (accurate to 2.1% with no fitting to that model). The order parameter, its signal/background decomposition, and the reported dimensionless quantities are constructed from observable quantities and shown consistent across architectures without reducing to the target results by definition or by self-citation chains. No load-bearing step equates a prediction to its own fitted input or imports uniqueness from prior author work.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 3 invented entities

The central claim rests on the additive decomposition of the order parameter and the assumption that the ten contexts suffice to expose the general mechanism; the dimensionless quantities are presented as organizing the data rather than being derived from first principles.

free parameters (1)
  • dimensionless quantities organizing trajectory
    A small number of dimensionless quantities are reported to organize behavior across architectures; one is consistent under full fine-tuning and a second sorts architectures for LoRA prediction.
axioms (1)
  • domain assumption The order parameter decomposes additively into signal (commitment to correct token) and background drag (embedding bulk leakage)
    Introduced in the abstract as the instrument for isolating kinematic versus structural failure modes.
invented entities (3)
  • order parameter no independent evidence
    purpose: Combines predicted distribution and pairwise embedding overlaps to track fine-tuning progress
    Newly defined to instrument the silent failures.
  • signal component no independent evidence
    purpose: Tracks model's commitment to correct token over nearest competitor
    Part of the additive decomposition.
  • background drag component no independent evidence
    purpose: Captures probability leakage from embedding bulk into the score
    Part of the additive decomposition.

pith-pipeline@v0.9.1-grok · 5828 in / 1542 out tokens · 36351 ms · 2026-06-29T21:52:21.101116+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

34 extracted references · 6 canonical work pages · 3 internal anchors

  1. [1]

    Intrinsic dimensionality explains the effectiveness of language model fine-tuning

    Armen Aghajanyan, Luke Zettlemoyer, and Sonal Gupta. Intrinsic dimensionality explains the effectiveness of language model fine-tuning. In Proceedings of ACL, 2021

  2. [2]

    Absence of diffusion in certain random lattices

    Philip W Anderson. Absence of diffusion in certain random lattices. Physical Review, 109 0 (5): 0 1492, 1958

  3. [3]

    Pythia: A suite for analyzing large language models across training and scaling

    Stella Biderman, Hailey Schoelkopf, Quentin Anthony, Herbie Bradley, Kyle O'Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, et al. Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, pages 2397--2430, 2023

  4. [4]

    Language models are few-shot learners

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in Neural Information Processing Systems, 33: 0 1877--1901, 2020

  5. [5]

    Isotropy in the contextual embedding space: Clusters and functional roles

    Xingyu Cai, Jiaji Dong, Pratik Rohatgi, and Kenneth W Church. Isotropy in the contextual embedding space: Clusters and functional roles. In International Conference on Learning Representations, 2021

  6. [6]

    A statistical physics of language model reasoning

    Jack David Carson and Amir Reisizadeh. A statistical physics of language model reasoning. arXiv preprint arXiv:2506.04374, 2025

  7. [7]

    Mathematical foundations for a compositional distributional model of meaning

    Bob Coecke, Mehrnoosh Sadrzadeh, and Stephen Clark. Mathematical foundations for a compositional distributional model of meaning. Linguistic Analysis, 36: 0 345--384, 2010

  8. [8]

    QLoRA : Efficient finetuning of quantized LLMs

    Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. QLoRA : Efficient finetuning of quantized LLMs . Advances in Neural Information Processing Systems, 36, 2023

  9. [9]

    BERT : Pre-training of deep bidirectional transformers for language understanding

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT : Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT, pages 4171--4186, 2019

  10. [10]

    arXiv preprint arXiv:2002.06305 , year=

    Jesse Dodge, Gabriel Ilharco, Roy Schwartz, Ali Farhadi, Hannaneh Hajishirzi, and Noah Smith. Fine-tuning pretrained language models: Weight initializations, data orders, and early stopping. arXiv preprint arXiv:2002.06305, 2020

  11. [11]

    The Llama 3 Herd of Models

    Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, et al. The Llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024

  12. [12]

    How contextual are contextualized word representations? comparing the geometry of BERT , ELMo , and GPT-2 embeddings

    Kawin Ethayarajh. How contextual are contextualized word representations? comparing the geometry of BERT , ELMo , and GPT-2 embeddings. In Proceedings of EMNLP-IJCNLP, pages 55--65, 2019

  13. [13]

    Representation degeneration problem in training natural language generation models

    Jun Gao, Di He, Xu Tan, Tao Qin, Liwei Wang, and Tie-Yan Liu. Representation degeneration problem in training natural language generation models. In International Conference on Learning Representations, 2019

  14. [14]

    Parameter-efficient transfer learning for NLP

    Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-efficient transfer learning for NLP . In International Conference on Machine Learning, pages 2790--2799, 2019

  15. [15]

    LoRA : Low-rank adaptation of large language models

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA : Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022

  16. [16]

    Survey of hallucination in natural language generation

    Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Madotto, and Pascale Fung. Survey of hallucination in natural language generation. ACM Computing Surveys, 55 0 (12): 0 1--38, 2023

  17. [17]

    Fine-tuning can distort pretrained features and underperform out-of-distribution

    Ananya Kumar, Aditi Raghunathan, Robbie Jones, Tengyu Ma, and Percy Liang. Fine-tuning can distort pretrained features and underperform out-of-distribution. In International Conference on Learning Representations, 2022

  18. [18]

    A path towards autonomous machine intelligence

    Yann LeCun. A path towards autonomous machine intelligence. OpenReview, 2022. Version 0.9.2

  19. [19]

    The power of scale for parameter-efficient prompt tuning

    Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient prompt tuning. In Proceedings of EMNLP, pages 3045--3059, 2021

  20. [20]

    On the large learning rate dynamics of SGD

    Aitor Lewkowycz, Yasaman Bahri, Ethan Dyer, Jascha Sohl-Dickstein, and Guy Gur-Ari. On the large learning rate dynamics of SGD . arXiv preprint arXiv:2006.10265, 2020

  21. [21]

    Prefix-tuning: Optimizing continuous prompts for generation

    Xiang Lisa Li and Percy Liang. Prefix-tuning: Optimizing continuous prompts for generation. In Proceedings of ACL-IJCNLP, pages 4582--4597, 2021

  22. [22]

    On faithfulness and factuality in abstractive summarization

    Joshua Maynez, Shashi Narayan, Bernd Bohnet, and Ryan McDonald. On faithfulness and factuality in abstractive summarization. In Proceedings of ACL, pages 1906--1919, 2020

  23. [23]

    On the stability of fine-tuning BERT : Misconceptions, explanations, and strong baselines

    Marius Mosbach, Maksym Andriushchenko, and Dietrich Klakow. On the stability of fine-tuning BERT : Misconceptions, explanations, and strong baselines. In International Conference on Learning Representations, 2021

  24. [24]

    Progress measures for grokking via mechanistic interpretability

    Neel Nanda, Lawrence Chan, Tom Lieberum, Jess Smith, and Jacob Steinhardt. Progress measures for grokking via mechanistic interpretability. In International Conference on Learning Representations, 2023

  25. [25]

    Quantum Computation and Quantum Information

    Michael A Nielsen and Isaac L Chuang. Quantum Computation and Quantum Information. Cambridge University Press, 2000

  26. [26]

    Training language models to follow instructions with human feedback

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35: 0 27730--27744, 2022

  27. [27]

    Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets

    Alethea Power, Yuri Burda, Harri Edwards, Igor Babuschkin, and Vedant Misra. Grokking: Generalization beyond overfitting on small algorithmic datasets. arXiv preprint arXiv:2201.02177, 2022

  28. [28]

    Using the output embedding to improve language models

    Ofir Press and Lior Wolf. Using the output embedding to improve language models. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, pages 157--163, 2017

  29. [29]

    Language models are unsupervised multitask learners

    Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. OpenAI Blog, 1 0 (8): 0 9, 2019

  30. [30]

    RoFormer : Enhanced transformer with rotary position embedding

    Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. RoFormer : Enhanced transformer with rotary position embedding. Neurocomputing, 568: 0 127063, 2024

  31. [31]

    LLaMA: Open and Efficient Foundation Language Models

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth \'e e Lacroix, Baptiste Rozi \`e re, Naman Goyal, Eric Hambro, Faisal Azhar, et al. LLaMA : Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023

  32. [32]

    Attention is all you need

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, volume 30, 2017

  33. [33]

    Emergent abilities of large language models

    Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Miculivicius, et al. Emergent abilities of large language models. Transactions on Machine Learning Research, 2022

  34. [34]

    End-to-end quantum-like language models with application to question answering

    Peng Zhang, Jiabin Niu, Zhan Su, Benyou Wang, Leyu Ma, and Dawei Song. End-to-end quantum-like language models with application to question answering. In Proceedings of AAAI, volume 32, 2018