pith. sign in

arxiv: 2303.13506 · v3 · pith:WJIAKUMXnew · submitted 2023-03-23 · 💻 cs.LG · cond-mat.dis-nn

The Quantization Model of Neural Scaling

classification 💻 cs.LG cond-mat.dis-nn
keywords modelscalingpowerquantalanguagequantizationdecomposefrequency
0
0 comments X
read the original abstract

We propose the Quantization Model of neural scaling laws, explaining both the observed power law dropoff of loss with model and data size, and also the sudden emergence of new capabilities with scale. We derive this model from what we call the Quantization Hypothesis, where network knowledge and skills are "quantized" into discrete chunks ($\textbf{quanta}$). We show that when quanta are learned in order of decreasing use frequency, then a power law in use frequencies explains observed power law scaling of loss. We validate this prediction on toy datasets, then study how scaling curves decompose for large language models. Using language model gradients, we automatically decompose model behavior into a diverse set of skills (quanta). We tentatively find that the frequency at which these quanta are used in the training distribution roughly follows a power law corresponding with the empirical scaling exponent for language models, a prediction of our theory.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. The Long Delay to Arithmetic Generalization: When Learned Representations Outrun Behavior

    cs.LG 2026-03 unverdicted novelty 7.0

    The grokking delay in encoder-decoder models on one-step Collatz prediction stems from decoder inability to use early-learned encoder representations of parity and residue structure, with numeral base acting as a stro...

  2. Correcting Influence: Unboxing LLM Outputs with Orthogonal Latent Spaces

    cs.LG 2026-05 unverdicted novelty 6.0

    A latent mediation framework with sparse autoencoders enables non-additive token-level influence attribution in LLMs by learning orthogonal features and back-propagating attributions.

  3. Emergent Capabilities Arise Randomly from Learning Sparse Attention Patterns

    cs.LG 2026-06 unverdicted novelty 5.0

    Emergent capabilities arise stochastically from abrupt learning of sparse attention patterns on synthetic linear map and cellular automata tasks, with larger models learning them earlier on average.