Phantom transitions in language model fine-tuning

Jayasri Dontabhaktuni; Vaibhav Prakash

arxiv: 2606.07559 · v1 · pith:H3N23R2Rnew · submitted 2026-05-25 · 💻 cs.CL · cs.AI· quant-ph

Phantom transitions in language model fine-tuning

Vaibhav Prakash , Jayasri Dontabhaktuni This is my paper

Pith reviewed 2026-06-29 21:52 UTC · model grok-4.3

classification 💻 cs.CL cs.AIquant-ph

keywords near-synonym fine-tuningphantom transitionssoftmax discontinuityorder parameterLoRA adaptationdimensionless quantitieskinematic failurestructural failure

0 comments

The pith

The apparent phase transitions in language model fine-tuning on near-synonym tasks are artifacts that live only in the softmax readout.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper studies fine-tuning failures where loss falls but the correct token fails to overtake a near-synonym competitor in rank. It defines an order parameter that splits the dynamics into a signal measuring commitment to the correct token and a drag term from embedding overlaps. Sharp jumps in this parameter resemble phase transitions yet persist when the embedding matrix is held fixed under LoRA training. A handful of dimensionless quantities derived from the order parameter organize trajectories across five architectures and predict the critical learning rate for a held-out model to within 2.1 percent.

Core claim

The transitions are phantoms. The spontaneous-symmetry-breaking interpretation is ruled out by direct measurement. Catapult-like jumps still appear under LoRA fine-tuning with the token embedding matrix exactly unchanged during training, where no geometric phase transition is possible. The discontinuity lives entirely in the softmax readout. A small number of dimensionless quantities organise the trajectory across architectures. One is consistent across all five under full fine-tuning. A second sorts architectures into two classes by bulk embedding distribution and predicts LoRA sufficiency. As a blind test, the framework predicts the critical learning rate of a held-out architecture to with

What carries the argument

The order parameter that combines predicted distribution with pairwise embedding overlaps and decomposes additively into signal and background drag.

If this is right

Catapult jumps persist when the embedding matrix remains exactly unchanged.
Architectures divide into two classes by bulk embedding distribution that predicts whether LoRA alone suffices.
One dimensionless quantity remains consistent across all tested models under full fine-tuning.
The same framework predicts critical learning rates for unseen architectures to within 2.1 percent.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Monitoring the signal-to-drag ratio during training could flag impending structural failures before loss plateaus.
Targeted adjustments to the softmax temperature or normalization might eliminate the phantom jumps without touching embeddings.
The decomposition into signal and drag may generalize to other probability outputs where competitors compete in rank.

Load-bearing premise

The ten hand-selected near-synonym contexts represent the general near-synonym failure regime and the order parameter captures the relevant dynamics without omitting other factors.

What would settle it

Repeating the LoRA runs with fixed embeddings on a fresh set of near-synonym contexts and checking whether the order-parameter jumps disappear.

Figures

Figures reproduced from arXiv: 2606.07559 by Jayasri Dontabhaktuni, Vaibhav Prakash.

**Figure 1.** Figure 1: Purity and Participation Ratio across three scenarios. Left: all probability on the correct token, purity = 1, PR = 1. Centre: uniform probability over four orthogonal tokens, purity = 0.25, PR = 4. Right: equal probability on g and a near-synonym competitor c with Ggc = 0.85, purity = 0.86, PR = 1.16 instead of 2. The geometric coupling inflates purity and artificially reduces PR. The model looks almost c… view at source ↗

**Figure 2.** Figure 2: The discrimination criterion and burial depth. The localisation length ξ is the angular radius of the prediction cloud on the unit embedding sphere. ξ > arccos(Gmax) (left, B > 1) means g and c both fall inside the cloud and the model is geometrically unresolved. Fine-tuning must physically compress the cloud to B < 1 (right) before Born scoring can resolve the competition. Pulling together the kinetics of… view at source ↗

**Figure 3.** Figure 3: Step-level Born-gap dynamics on SmolLM-360M FULL FT (η = 2 × 10−5 ). High-Gmax sentences (top, longing) show 9 quiescent steps then a single jump (ratio 33.7). Low-Gmax sentences (bottom, purpose) rise continuously from step 1 (ratio 4.3). Green dashed lines mark the flip step. A learning-rate sweep (η across 5 values, all 10 sentences on SmolLM-360M, 50 combinations) is cleanly separated by a single empir… view at source ↗

**Figure 4.** Figure 4: Causal isolation (DM-5d). Freezing the embedding matrix under LORA converts all sharp jumps into smooth trajectories and eliminates the predictive power of Gmax. 14 [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗

**Figure 5.** Figure 5: Hidden-state alignment under FULL FT: per-step ψg ·h (green), ψc ·h (red), and the logit gap (dashed black). The purple line marks the Φ flip step. All three curves evolve smoothly through the flip. The discontinuity resides entirely in the softmax readout. 5.1 Signal and Background Drag: Two Types of Failure The SmolLM-360M experiments resolve every sentence. The two failure types of §1, kinematic (drag i… view at source ↗

**Figure 6.** Figure 6: Universality class is determined by bulk embedding geometry, not parameter count. Left: Class A (distilgpt2 82M, GPT-2-medium 345M, SmolLM-360M) with dense Gaussian bulk, G 2 mean ∈ [0.045,0.097]. Right: Class B (Pythia-70M, Pythia-410M) with sparse exponential bulk, G 2 mean ∈ [0.002,0.003]. Pythia-70M (70M) is Class B while distilgpt2 (82M) is Class A. rate places every model in the sample at the same di… view at source ↗

**Figure 7.** Figure 7: HFT is consistent within our sample of five architectures (CV = 9.3%, n = 5) at η = 2×10−5 . All five models sit at H ≈ 10, roughly ten times above their own fitted saturation threshold, despite spanning two families, two geometry classes, and a 5× parameter range. This is an empirical regularity in our sample, not a proven physical law. Validation on additional architecture families is needed. two reading… view at source ↗

**Figure 8.** Figure 8: Cross-model laws linking geometry to LORA sufficiency. Left: BFT correlates with bulk embedding density (Spearman ρ = +0.90, p = 0.037). Denser bulk gives deeper burial. Right: BFT anticorrelates with the reduced field H (Spearman ρ = −0.90, p = 0.037). More headroom above the saturation threshold gives shallower burial. Bulk geometry determines both θ ∗ and B. 7.1 Scope of the Experimental Dataset The te… view at source ↗

read the original abstract

Fine-tuning a language model on contexts whose correct completion has a near-synonym competitor often fails silently. The cross-entropy loss decreases monotonically while the correct token never overtakes the competitor in rank. We study this regime across five transformer architectures spanning two families and a fivefold parameter range, on ten hand-selected near-synonym contexts. We instrument these failures with an order parameter combining the predicted distribution and pairwise embedding overlaps. It decomposes additively into a signal, tracking the model's commitment to the correct token over its nearest competitor, and a background drag, set by how the embedding bulk leaks probability into the score. This isolates two failure modes. In kinematic failure the signal stays small. In structural failure the drag actively worsens as fine-tuning proceeds. We observe sharp catapult-like jumps in the order parameter that resemble a phase transition. A central negative result organises the paper. The transitions are phantoms. The spontaneous-symmetry-breaking interpretation is ruled out by direct measurement. Catapult-like jumps still appear under LoRA fine-tuning with the token embedding matrix exactly unchanged during training, where no geometric phase transition is possible. The discontinuity lives entirely in the softmax readout. A small number of dimensionless quantities organise the trajectory across architectures. One is consistent across all five under full fine-tuning. A second sorts architectures into two classes by bulk embedding distribution and predicts LoRA sufficiency. As a blind test, the framework predicts the critical learning rate of a held-out architecture, not used to fit any parameter, to within 2.1% of a subsequent learning-rate sweep. Findings concern the near-synonym mechanism only and should not be extrapolated without recalibration.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's core result is that apparent phase transitions during fine-tuning on near-synonym contexts persist with fixed embeddings under LoRA, so the jumps live in the softmax readout rather than embedding geometry.

read the letter

The main point is that the catapult-like jumps in this order parameter are not geometric phase transitions. They still show up when LoRA keeps the token embedding matrix completely fixed, which rules out spontaneous symmetry breaking in the embeddings. The discontinuity is isolated to the softmax readout instead.

The decomposition into a signal component (how much the model favors the correct token over its competitor) and a background drag (leakage from the rest of the embedding bulk) is useful for separating kinematic failures from structural ones. Running the same setup across five architectures from two families, plus the blind prediction of critical learning rate on a held-out model to 2.1% accuracy, gives the claims some independent grounding. The dimensionless quantities that organize the trajectories across models are the most practical output.

The LoRA experiment is a clean direct test of the negative claim. The consistency checks and the fact that one quantity sorts architectures by bulk embedding distribution add weight without obvious circularity.

The main limitation is the ten hand-selected contexts. Nothing in the work shows they are representative of the broader near-synonym regime, so the dimensionless predictors may not travel as far as claimed without recalibration. The order parameter itself combines distribution and pairwise overlaps in a reasonable way, but it is possible other factors in the dynamics are left out.

This is aimed at people who study training dynamics and silent failure modes in LLM fine-tuning. The empirical scope and the successful blind test are enough to justify sending it to peer review rather than desk rejection.

Referee Report

0 major / 3 minor

Summary. The paper examines silent failures during fine-tuning of language models on near-synonym contexts, where cross-entropy loss decreases but the correct token fails to overtake its competitor in rank. Across five transformer architectures (two families, fivefold parameter range) and ten hand-selected contexts, the authors introduce an order parameter that decomposes additively into a signal component (commitment to the correct token) and background drag (embedding bulk leakage). They report catapult-like jumps in this order parameter but demonstrate via LoRA fine-tuning with the token embedding matrix held exactly fixed that these are phantom transitions occurring entirely in the softmax readout, ruling out spontaneous symmetry breaking in embedding geometry. Dimensionless quantities organize trajectories across architectures, one consistent under full fine-tuning and another sorting by bulk embedding distribution to predict LoRA sufficiency; a blind test predicts the critical learning rate of a held-out architecture to within 2.1%.

Significance. If the central negative result holds, the work supplies a mechanistic account of a common fine-tuning failure mode, isolating the discontinuity to the readout and providing non-circular support through the fixed-embedding LoRA condition, cross-architecture consistency of dimensionless quantities, and successful blind prediction. These elements (multi-architecture span, direct test of the geometric hypothesis, and out-of-sample prediction) strengthen the phantom diagnosis for the measured cases and could guide targeted interventions in near-synonym regimes.

minor comments (3)

[Abstract / Experiments] The abstract and methods should explicitly state the selection criteria or sampling procedure for the ten near-synonym contexts to support reproducibility and allow readers to assess representativeness of the failure regime.
[Order parameter section] The order parameter is described as combining predicted distribution and pairwise embedding overlaps; formalizing its exact definition (including any normalization or weighting) with an equation in the main text would improve clarity.
[Results figures] Figures reporting the order parameter trajectories and learning-rate sweeps should include explicit discussion of error bars, exclusion criteria, and statistical significance of the observed jumps.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their careful summary of the work and for the positive recommendation to accept. The report correctly identifies the central negative result (phantom transitions confined to the softmax readout) and the supporting elements (fixed-embedding LoRA controls, cross-architecture dimensionless quantities, and blind prediction). No major comments were raised that require response or revision.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper's central claims rest on direct experimental measurements (LoRA runs with token embeddings held exactly fixed, ruling out geometric phase transitions) and a blind prediction of critical learning rate on a held-out architecture (accurate to 2.1% with no fitting to that model). The order parameter, its signal/background decomposition, and the reported dimensionless quantities are constructed from observable quantities and shown consistent across architectures without reducing to the target results by definition or by self-citation chains. No load-bearing step equates a prediction to its own fitted input or imports uniqueness from prior author work.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 3 invented entities

The central claim rests on the additive decomposition of the order parameter and the assumption that the ten contexts suffice to expose the general mechanism; the dimensionless quantities are presented as organizing the data rather than being derived from first principles.

free parameters (1)

dimensionless quantities organizing trajectory
A small number of dimensionless quantities are reported to organize behavior across architectures; one is consistent under full fine-tuning and a second sorts architectures for LoRA prediction.

axioms (1)

domain assumption The order parameter decomposes additively into signal (commitment to correct token) and background drag (embedding bulk leakage)
Introduced in the abstract as the instrument for isolating kinematic versus structural failure modes.

invented entities (3)

order parameter no independent evidence
purpose: Combines predicted distribution and pairwise embedding overlaps to track fine-tuning progress
Newly defined to instrument the silent failures.
signal component no independent evidence
purpose: Tracks model's commitment to correct token over nearest competitor
Part of the additive decomposition.
background drag component no independent evidence
purpose: Captures probability leakage from embedding bulk into the score
Part of the additive decomposition.

pith-pipeline@v0.9.1-grok · 5828 in / 1542 out tokens · 36351 ms · 2026-06-29T21:52:21.101116+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

34 extracted references · 6 canonical work pages · 3 internal anchors

[1]

Intrinsic dimensionality explains the effectiveness of language model fine-tuning

Armen Aghajanyan, Luke Zettlemoyer, and Sonal Gupta. Intrinsic dimensionality explains the effectiveness of language model fine-tuning. In Proceedings of ACL, 2021

2021
[2]

Absence of diffusion in certain random lattices

Philip W Anderson. Absence of diffusion in certain random lattices. Physical Review, 109 0 (5): 0 1492, 1958

1958
[3]

Pythia: A suite for analyzing large language models across training and scaling

Stella Biderman, Hailey Schoelkopf, Quentin Anthony, Herbie Bradley, Kyle O'Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, et al. Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, pages 2397--2430, 2023

2023
[4]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in Neural Information Processing Systems, 33: 0 1877--1901, 2020

1901
[5]

Isotropy in the contextual embedding space: Clusters and functional roles

Xingyu Cai, Jiaji Dong, Pratik Rohatgi, and Kenneth W Church. Isotropy in the contextual embedding space: Clusters and functional roles. In International Conference on Learning Representations, 2021

2021
[6]

A statistical physics of language model reasoning

Jack David Carson and Amir Reisizadeh. A statistical physics of language model reasoning. arXiv preprint arXiv:2506.04374, 2025

work page arXiv 2025
[7]

Mathematical foundations for a compositional distributional model of meaning

Bob Coecke, Mehrnoosh Sadrzadeh, and Stephen Clark. Mathematical foundations for a compositional distributional model of meaning. Linguistic Analysis, 36: 0 345--384, 2010

2010
[8]

QLoRA : Efficient finetuning of quantized LLMs

Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. QLoRA : Efficient finetuning of quantized LLMs . Advances in Neural Information Processing Systems, 36, 2023

2023
[9]

BERT : Pre-training of deep bidirectional transformers for language understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT : Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT, pages 4171--4186, 2019

2019
[10]

arXiv preprint arXiv:2002.06305 , year=

Jesse Dodge, Gabriel Ilharco, Roy Schwartz, Ali Farhadi, Hannaneh Hajishirzi, and Noah Smith. Fine-tuning pretrained language models: Weight initializations, data orders, and early stopping. arXiv preprint arXiv:2002.06305, 2020

work page arXiv 2002
[11]

The Llama 3 Herd of Models

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, et al. The Llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[12]

How contextual are contextualized word representations? comparing the geometry of BERT , ELMo , and GPT-2 embeddings

Kawin Ethayarajh. How contextual are contextualized word representations? comparing the geometry of BERT , ELMo , and GPT-2 embeddings. In Proceedings of EMNLP-IJCNLP, pages 55--65, 2019

2019
[13]

Representation degeneration problem in training natural language generation models

Jun Gao, Di He, Xu Tan, Tao Qin, Liwei Wang, and Tie-Yan Liu. Representation degeneration problem in training natural language generation models. In International Conference on Learning Representations, 2019

2019
[14]

Parameter-efficient transfer learning for NLP

Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-efficient transfer learning for NLP . In International Conference on Machine Learning, pages 2790--2799, 2019

2019
[15]

LoRA : Low-rank adaptation of large language models

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA : Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022

2022
[16]

Survey of hallucination in natural language generation

Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Madotto, and Pascale Fung. Survey of hallucination in natural language generation. ACM Computing Surveys, 55 0 (12): 0 1--38, 2023

2023
[17]

Fine-tuning can distort pretrained features and underperform out-of-distribution

Ananya Kumar, Aditi Raghunathan, Robbie Jones, Tengyu Ma, and Percy Liang. Fine-tuning can distort pretrained features and underperform out-of-distribution. In International Conference on Learning Representations, 2022

2022
[18]

A path towards autonomous machine intelligence

Yann LeCun. A path towards autonomous machine intelligence. OpenReview, 2022. Version 0.9.2

2022
[19]

The power of scale for parameter-efficient prompt tuning

Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient prompt tuning. In Proceedings of EMNLP, pages 3045--3059, 2021

2021
[20]

On the large learning rate dynamics of SGD

Aitor Lewkowycz, Yasaman Bahri, Ethan Dyer, Jascha Sohl-Dickstein, and Guy Gur-Ari. On the large learning rate dynamics of SGD . arXiv preprint arXiv:2006.10265, 2020

work page arXiv 2006
[21]

Prefix-tuning: Optimizing continuous prompts for generation

Xiang Lisa Li and Percy Liang. Prefix-tuning: Optimizing continuous prompts for generation. In Proceedings of ACL-IJCNLP, pages 4582--4597, 2021

2021
[22]

On faithfulness and factuality in abstractive summarization

Joshua Maynez, Shashi Narayan, Bernd Bohnet, and Ryan McDonald. On faithfulness and factuality in abstractive summarization. In Proceedings of ACL, pages 1906--1919, 2020

1906
[23]

On the stability of fine-tuning BERT : Misconceptions, explanations, and strong baselines

Marius Mosbach, Maksym Andriushchenko, and Dietrich Klakow. On the stability of fine-tuning BERT : Misconceptions, explanations, and strong baselines. In International Conference on Learning Representations, 2021

2021
[24]

Progress measures for grokking via mechanistic interpretability

Neel Nanda, Lawrence Chan, Tom Lieberum, Jess Smith, and Jacob Steinhardt. Progress measures for grokking via mechanistic interpretability. In International Conference on Learning Representations, 2023

2023
[25]

Quantum Computation and Quantum Information

Michael A Nielsen and Isaac L Chuang. Quantum Computation and Quantum Information. Cambridge University Press, 2000

2000
[26]

Training language models to follow instructions with human feedback

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35: 0 27730--27744, 2022

2022
[27]

Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets

Alethea Power, Yuri Burda, Harri Edwards, Igor Babuschkin, and Vedant Misra. Grokking: Generalization beyond overfitting on small algorithmic datasets. arXiv preprint arXiv:2201.02177, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[28]

Using the output embedding to improve language models

Ofir Press and Lior Wolf. Using the output embedding to improve language models. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, pages 157--163, 2017

2017
[29]

Language models are unsupervised multitask learners

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. OpenAI Blog, 1 0 (8): 0 9, 2019

2019
[30]

RoFormer : Enhanced transformer with rotary position embedding

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. RoFormer : Enhanced transformer with rotary position embedding. Neurocomputing, 568: 0 127063, 2024

2024
[31]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth \'e e Lacroix, Baptiste Rozi \`e re, Naman Goyal, Eric Hambro, Faisal Azhar, et al. LLaMA : Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[32]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, volume 30, 2017

2017
[33]

Emergent abilities of large language models

Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Miculivicius, et al. Emergent abilities of large language models. Transactions on Machine Learning Research, 2022

2022
[34]

End-to-end quantum-like language models with application to question answering

Peng Zhang, Jiabin Niu, Zhan Su, Benyou Wang, Leyu Ma, and Dawei Song. End-to-end quantum-like language models with application to question answering. In Proceedings of AAAI, volume 32, 2018

2018

[1] [1]

Intrinsic dimensionality explains the effectiveness of language model fine-tuning

Armen Aghajanyan, Luke Zettlemoyer, and Sonal Gupta. Intrinsic dimensionality explains the effectiveness of language model fine-tuning. In Proceedings of ACL, 2021

2021

[2] [2]

Absence of diffusion in certain random lattices

Philip W Anderson. Absence of diffusion in certain random lattices. Physical Review, 109 0 (5): 0 1492, 1958

1958

[3] [3]

Pythia: A suite for analyzing large language models across training and scaling

Stella Biderman, Hailey Schoelkopf, Quentin Anthony, Herbie Bradley, Kyle O'Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, et al. Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, pages 2397--2430, 2023

2023

[4] [4]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in Neural Information Processing Systems, 33: 0 1877--1901, 2020

1901

[5] [5]

Isotropy in the contextual embedding space: Clusters and functional roles

Xingyu Cai, Jiaji Dong, Pratik Rohatgi, and Kenneth W Church. Isotropy in the contextual embedding space: Clusters and functional roles. In International Conference on Learning Representations, 2021

2021

[6] [6]

A statistical physics of language model reasoning

Jack David Carson and Amir Reisizadeh. A statistical physics of language model reasoning. arXiv preprint arXiv:2506.04374, 2025

work page arXiv 2025

[7] [7]

Mathematical foundations for a compositional distributional model of meaning

Bob Coecke, Mehrnoosh Sadrzadeh, and Stephen Clark. Mathematical foundations for a compositional distributional model of meaning. Linguistic Analysis, 36: 0 345--384, 2010

2010

[8] [8]

QLoRA : Efficient finetuning of quantized LLMs

Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. QLoRA : Efficient finetuning of quantized LLMs . Advances in Neural Information Processing Systems, 36, 2023

2023

[9] [9]

BERT : Pre-training of deep bidirectional transformers for language understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT : Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT, pages 4171--4186, 2019

2019

[10] [10]

arXiv preprint arXiv:2002.06305 , year=

Jesse Dodge, Gabriel Ilharco, Roy Schwartz, Ali Farhadi, Hannaneh Hajishirzi, and Noah Smith. Fine-tuning pretrained language models: Weight initializations, data orders, and early stopping. arXiv preprint arXiv:2002.06305, 2020

work page arXiv 2002

[11] [11]

The Llama 3 Herd of Models

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, et al. The Llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[12] [12]

How contextual are contextualized word representations? comparing the geometry of BERT , ELMo , and GPT-2 embeddings

Kawin Ethayarajh. How contextual are contextualized word representations? comparing the geometry of BERT , ELMo , and GPT-2 embeddings. In Proceedings of EMNLP-IJCNLP, pages 55--65, 2019

2019

[13] [13]

Representation degeneration problem in training natural language generation models

Jun Gao, Di He, Xu Tan, Tao Qin, Liwei Wang, and Tie-Yan Liu. Representation degeneration problem in training natural language generation models. In International Conference on Learning Representations, 2019

2019

[14] [14]

Parameter-efficient transfer learning for NLP

Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-efficient transfer learning for NLP . In International Conference on Machine Learning, pages 2790--2799, 2019

2019

[15] [15]

LoRA : Low-rank adaptation of large language models

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA : Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022

2022

[16] [16]

Survey of hallucination in natural language generation

Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Madotto, and Pascale Fung. Survey of hallucination in natural language generation. ACM Computing Surveys, 55 0 (12): 0 1--38, 2023

2023

[17] [17]

Fine-tuning can distort pretrained features and underperform out-of-distribution

Ananya Kumar, Aditi Raghunathan, Robbie Jones, Tengyu Ma, and Percy Liang. Fine-tuning can distort pretrained features and underperform out-of-distribution. In International Conference on Learning Representations, 2022

2022

[18] [18]

A path towards autonomous machine intelligence

Yann LeCun. A path towards autonomous machine intelligence. OpenReview, 2022. Version 0.9.2

2022

[19] [19]

The power of scale for parameter-efficient prompt tuning

Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient prompt tuning. In Proceedings of EMNLP, pages 3045--3059, 2021

2021

[20] [20]

On the large learning rate dynamics of SGD

Aitor Lewkowycz, Yasaman Bahri, Ethan Dyer, Jascha Sohl-Dickstein, and Guy Gur-Ari. On the large learning rate dynamics of SGD . arXiv preprint arXiv:2006.10265, 2020

work page arXiv 2006

[21] [21]

Prefix-tuning: Optimizing continuous prompts for generation

Xiang Lisa Li and Percy Liang. Prefix-tuning: Optimizing continuous prompts for generation. In Proceedings of ACL-IJCNLP, pages 4582--4597, 2021

2021

[22] [22]

On faithfulness and factuality in abstractive summarization

Joshua Maynez, Shashi Narayan, Bernd Bohnet, and Ryan McDonald. On faithfulness and factuality in abstractive summarization. In Proceedings of ACL, pages 1906--1919, 2020

1906

[23] [23]

On the stability of fine-tuning BERT : Misconceptions, explanations, and strong baselines

Marius Mosbach, Maksym Andriushchenko, and Dietrich Klakow. On the stability of fine-tuning BERT : Misconceptions, explanations, and strong baselines. In International Conference on Learning Representations, 2021

2021

[24] [24]

Progress measures for grokking via mechanistic interpretability

Neel Nanda, Lawrence Chan, Tom Lieberum, Jess Smith, and Jacob Steinhardt. Progress measures for grokking via mechanistic interpretability. In International Conference on Learning Representations, 2023

2023

[25] [25]

Quantum Computation and Quantum Information

Michael A Nielsen and Isaac L Chuang. Quantum Computation and Quantum Information. Cambridge University Press, 2000

2000

[26] [26]

Training language models to follow instructions with human feedback

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35: 0 27730--27744, 2022

2022

[27] [27]

Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets

Alethea Power, Yuri Burda, Harri Edwards, Igor Babuschkin, and Vedant Misra. Grokking: Generalization beyond overfitting on small algorithmic datasets. arXiv preprint arXiv:2201.02177, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[28] [28]

Using the output embedding to improve language models

Ofir Press and Lior Wolf. Using the output embedding to improve language models. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, pages 157--163, 2017

2017

[29] [29]

Language models are unsupervised multitask learners

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. OpenAI Blog, 1 0 (8): 0 9, 2019

2019

[30] [30]

RoFormer : Enhanced transformer with rotary position embedding

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. RoFormer : Enhanced transformer with rotary position embedding. Neurocomputing, 568: 0 127063, 2024

2024

[31] [31]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth \'e e Lacroix, Baptiste Rozi \`e re, Naman Goyal, Eric Hambro, Faisal Azhar, et al. LLaMA : Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[32] [32]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, volume 30, 2017

2017

[33] [33]

Emergent abilities of large language models

Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Miculivicius, et al. Emergent abilities of large language models. Transactions on Machine Learning Research, 2022

2022

[34] [34]

End-to-end quantum-like language models with application to question answering

Peng Zhang, Jiabin Niu, Zhan Su, Benyou Wang, Leyu Ma, and Dawei Song. End-to-end quantum-like language models with application to question answering. In Proceedings of AAAI, volume 32, 2018

2018