Grokking as Compression: A Nonlinear Complexity Perspective

Max Tegmark; Ziming Liu; Ziqian Zhong

arxiv: 2310.05918 · v1 · pith:3V7SYW4Gnew · submitted 2023-10-09 · 💻 cs.LG · cs.AI· stat.ML

Grokking as Compression: A Nonlinear Complexity Perspective

Ziming Liu , Ziqian Zhong , Max Tegmark This is my paper

classification 💻 cs.LG cs.AIstat.ML

keywords complexitycompressionlinearnetworkgeneralizationgrokkingneuralnumber

0 comments

read the original abstract

We attribute grokking, the phenomenon where generalization is much delayed after memorization, to compression. To do so, we define linear mapping number (LMN) to measure network complexity, which is a generalized version of linear region number for ReLU networks. LMN can nicely characterize neural network compression before generalization. Although the $L_2$ norm has been a popular choice for characterizing model complexity, we argue in favor of LMN for a number of reasons: (1) LMN can be naturally interpreted as information/computation, while $L_2$ cannot. (2) In the compression phase, LMN has linear relations with test losses, while $L_2$ is correlated with test losses in a complicated nonlinear way. (3) LMN also reveals an intriguing phenomenon of the XOR network switching between two generalization solutions, while $L_2$ does not. Besides explaining grokking, we argue that LMN is a promising candidate as the neural network version of the Kolmogorov complexity since it explicitly considers local or conditioned linear computations aligned with the nature of modern artificial neural networks.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Truth as a Compression Artifact in Language Model Training
cs.CL 2026-03 unverdicted novelty 6.0

Controlled experiments show language models extract correct answers from contradictory data only when errors are structurally incoherent, supporting the hypothesis that gradient descent selects the most compressible a...
Phase Transitions in Driven Informational Systems: A Two-Field Perspective on Learning Theory and Non-Equilibrium Chemistry
cs.LG 2026-05 unverdicted novelty 5.0

Proposes a two-gradient-field model with candidate order parameters alpha_dagger and kappa_c to unify phase transitions across learning theory and non-equilibrium chemistry.