pith. sign in

arxiv: 2602.03685 · v2 · pith:NPBXIYI3new · submitted 2026-02-03 · 💻 cs.LG · cs.AI· stat.ML

Universal One-third Time Scaling in Learning Peaked Distributions

classification 💻 cs.LG cs.AIstat.ML
keywords distributionspower-lawscalinglearningllmslossmodelspeaked
0
0 comments X
read the original abstract

Training large language models (LLMs) is computationally expensive, partly because the loss exhibits slow power-law convergence whose origin remains debatable. Through systematic analysis of toy models and empirical evaluation of LLMs, we show that this behavior can arise intrinsically from the use of softmax and cross-entropy. When learning peaked probability distributions, e.g., next-token distributions, these components generically yield power-law vanishing losses and gradients, regardless of many microscopic details, creating a fundamental optimization bottleneck. This ultimately leads to power-law time scaling of the loss with a universal exponent of $1/3$. Our results provide a mechanistic explanation for observed neural scaling and suggest new directions for improving LLM training efficiency.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Attractor Geometry of Transformer Memory: From Conflict Arbitration to Confident Hallucination

    cs.AI 2026-05 unverdicted novelty 7.0

    Transformer hidden states encode facts as attractor basins; hallucinations occur from basin absence and conflicts from basin competition, detected cleanly by geometric margin rather than entropy.

  2. Attractor Geometry of Transformer Memory: From Conflict Arbitration to Confident Hallucination

    cs.AI 2026-05 unverdicted novelty 7.0

    Attractor basins in transformer hidden states unify conflict and hallucination as basin competition or absence, with geometric margin outperforming entropy for detection and a scaling law governing confident hallucina...

  3. A Boundary-Layer Mechanism for One-Third Scaling in Online Softmax Classification

    cs.LG 2026-05 unverdicted novelty 6.0

    Derives α^{-1/3} scaling for generalization error in online softmax classification from boundary layers in a teacher-student model.