Universal One-third Time Scaling in Learning Peaked Distributions

Cengiz Pehlevan; Jeff Gore; Yizhou Liu; Ziming Liu

arxiv: 2602.03685 · v2 · pith:NPBXIYI3new · submitted 2026-02-03 · 💻 cs.LG · cs.AI· stat.ML

Universal One-third Time Scaling in Learning Peaked Distributions

Yizhou Liu , Ziming Liu , Cengiz Pehlevan , Jeff Gore This is my paper

classification 💻 cs.LG cs.AIstat.ML

keywords distributionspower-lawscalinglearningllmslossmodelspeaked

0 comments

read the original abstract

Training large language models (LLMs) is computationally expensive, partly because the loss exhibits slow power-law convergence whose origin remains debatable. Through systematic analysis of toy models and empirical evaluation of LLMs, we show that this behavior can arise intrinsically from the use of softmax and cross-entropy. When learning peaked probability distributions, e.g., next-token distributions, these components generically yield power-law vanishing losses and gradients, regardless of many microscopic details, creating a fundamental optimization bottleneck. This ultimately leads to power-law time scaling of the loss with a universal exponent of $1/3$. Our results provide a mechanistic explanation for observed neural scaling and suggest new directions for improving LLM training efficiency.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Attractor Geometry of Transformer Memory: From Conflict Arbitration to Confident Hallucination
cs.AI 2026-05 unverdicted novelty 7.0

Transformer hidden states encode facts as attractor basins; hallucinations occur from basin absence and conflicts from basin competition, detected cleanly by geometric margin rather than entropy.
Attractor Geometry of Transformer Memory: From Conflict Arbitration to Confident Hallucination
cs.AI 2026-05 unverdicted novelty 7.0

Attractor basins in transformer hidden states unify conflict and hallucination as basin competition or absence, with geometric margin outperforming entropy for detection and a scaling law governing confident hallucina...
A Boundary-Layer Mechanism for One-Third Scaling in Online Softmax Classification
cs.LG 2026-05 unverdicted novelty 6.0

Derives α^{-1/3} scaling for generalization error in online softmax classification from boundary layers in a teacher-student model.