Learning is Forgetting: LLM Training As Lossy Compression
Pith reviewed 2026-05-10 17:24 UTC · model grok-4.3
The pith
LLM pre-training produces models that compress training data near-optimally for next-sequence prediction, and this compression quality predicts downstream benchmark performance.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We argue LLMs are best seen as an instance of lossy compression, where over training they learn by retaining only information in their training data relevant to their objective(s). We show pre-training results in models that are optimally compressed for next-sequence prediction, approaching the Information Bottleneck bound on compression. Across an array of open weights models, each compresses differently, likely due to differences in the data and training recipes used. However even across different families of LLMs the optimality of a model's compression, and the information present in it, can predict downstream performance on across a wide array of benchmarks, letting us directly link the
What carries the argument
The Information Bottleneck bound, which gives the lowest possible rate of retained information that still permits accurate next-sequence prediction.
If this is right
- Different training recipes and data sets produce measurably different degrees of compression even among models of similar size.
- A model's closeness to the Information Bottleneck limit supplies a scalar predictor of its accuracy on many unrelated downstream tasks.
- Representational structure inside a trained network can be read out as retained information and used to forecast capabilities without running every benchmark.
- The same framing applies uniformly to any model family that optimizes a next-sequence objective.
Where Pith is reading between the lines
- If compression optimality is the operative variable, then training procedures could be modified to push models closer to the bound more quickly.
- The view recasts forgetting not as a defect but as the necessary discarding of task-irrelevant information during compression.
- Benchmark suites might be supplemented or partially replaced by direct measurements of retained information on representative data.
Load-bearing premise
That a model's distance to the compression bound can be measured without reference to the downstream benchmarks and that any observed link to performance reflects a general relationship rather than shared dependence on model scale or data overlap.
What would settle it
Compute the compression rate of two models of matched scale on the same held-out sequences; if the model closer to the bound does not reliably outperform the other on new benchmarks after controlling for scale, the claimed predictive power is refuted.
Figures
read the original abstract
Despite the increasing prevalence of large language models (LLMs), we still have a limited understanding of how their representational spaces are structured. This limits our ability to interpret how and what they learn or relate them to learning in humans. We argue LLMs are best seen as an instance of lossy compression, where over training they learn by retaining only information in their training data relevant to their objective(s). We show pre-training results in models that are optimally compressed for next-sequence prediction, approaching the Information Bottleneck bound on compression. Across an array of open weights models, each compresses differently, likely due to differences in the data and training recipes used. However even across different families of LLMs the optimality of a model's compression, and the information present in it, can predict downstream performance on across a wide array of benchmarks, letting us directly link representational structure to actionable insights about model performance. In the general case the work presented here offers a unified Information-Theoretic framing for how these models learn that is deployable at scale.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that LLM pre-training constitutes lossy compression, with models approaching the Information Bottleneck bound for next-sequence prediction. It further claims that scalar measures of compression optimality and retained information extracted from the model can predict downstream benchmark performance across model families, providing a unified information-theoretic account of learning that is actionable at scale.
Significance. If the claims are substantiated with independent measurements and confound controls, the work would supply a principled way to connect representational structure directly to performance prediction, potentially reducing reliance on exhaustive benchmarking and offering an IT lens that generalizes across architectures. The cross-family scope using open-weight models is a strength that could support broader applicability if the metrics prove robust.
major comments (3)
- [Abstract] Abstract: The central claim that pre-training produces models 'optimally compressed for next-sequence prediction, approaching the Information Bottleneck bound' is presented without any description of the IB formulation used, the estimators for rate or mutual information, or the procedure for quantifying 'approach' to the bound. This measurement detail is load-bearing for both the optimality assertion and the subsequent performance-prediction claim.
- [Results section (cross-family experiments)] Results section (cross-family experiments): The reported ability of compression optimality to predict downstream performance 'even across different families' lacks any indication of controls for model scale (parameter count) or data-overlap statistics. Without regression covariates, partial correlations, or ablation on size-matched subsets, the correlation is consistent with the known scale-performance relationship rather than an independent effect of representational compression.
- [Methods] Methods: The manuscript must demonstrate that the compression-optimality scalar is computed from quantities independent of both the next-token training loss and the downstream benchmark distributions. If the metric relies on perplexity, layer-wise statistics, or effective rate derived from the same training corpus, the predictive relationship risks circularity and does not establish a new causal or structural link.
minor comments (2)
- [Abstract] Abstract contains a phrasing error: 'predict downstream performance on across a wide array of benchmarks' should read 'on a wide array of benchmarks'.
- The paper would benefit from an explicit equation or pseudocode for the compression-optimality metric in the main text (or a dedicated appendix) to enable reproduction.
Simulated Author's Rebuttal
We thank the referee for their constructive comments, which have helped us identify areas to strengthen the manuscript. We address each major comment in turn and outline the revisions we will make.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claim that pre-training produces models 'optimally compressed for next-sequence prediction, approaching the Information Bottleneck bound' is presented without any description of the IB formulation used, the estimators for rate or mutual information, or the procedure for quantifying 'approach' to the bound. This measurement detail is load-bearing for both the optimality assertion and the subsequent performance-prediction claim.
Authors: We agree that the abstract, being a high-level summary, omits the specific technical details of the Information Bottleneck (IB) formulation. The full manuscript's Methods section details the variational IB bound for next-sequence prediction, including the estimators for the rate (using a variational approximation to the mutual information between inputs and representations) and the distortion term tied to the prediction objective, along with the procedure for measuring proximity to the bound via normalized rate-distortion curves. To improve accessibility, we will revise the abstract to briefly mention the use of variational IB estimators and the approach to the bound, while keeping it concise. revision: partial
-
Referee: [Results section (cross-family experiments)] Results section (cross-family experiments): The reported ability of compression optimality to predict downstream performance 'even across different families' lacks any indication of controls for model scale (parameter count) or data-overlap statistics. Without regression covariates, partial correlations, or ablation on size-matched subsets, the correlation is consistent with the known scale-performance relationship rather than an independent effect of representational compression.
Authors: This is a fair critique regarding potential confounds. While our experiments span multiple model families with varying scales, we did not include explicit controls such as partial correlations or size-matched ablations in the reported results. We will add these analyses in the revised version, demonstrating that the compression optimality metric retains significant predictive power for benchmark performance even after controlling for parameter count. Regarding data-overlap, we will include a discussion noting that cross-family open models reduce overlap risks, supported by available metadata on training data sources. revision: yes
-
Referee: [Methods] Methods: The manuscript must demonstrate that the compression-optimality scalar is computed from quantities independent of both the next-token training loss and the downstream benchmark distributions. If the metric relies on perplexity, layer-wise statistics, or effective rate derived from the same training corpus, the predictive relationship risks circularity and does not establish a new causal or structural link.
Authors: We appreciate the need to clarify independence to avoid any perception of circularity. The compression-optimality scalar is computed using a separate information-theoretic probe: specifically, it derives from estimates of mutual information between the model's hidden activations (on a held-out validation set disjoint from training) and the input sequences, using a non-parametric estimator that does not rely on the training loss or perplexity. The retained information measure uses similar probes independent of downstream benchmarks. We will expand the Methods section with additional details, including pseudocode and explicit statements confirming the use of held-out data and independence from training objectives and benchmark distributions. revision: yes
Circularity Check
No significant circularity; empirical claims rest on independent measurements.
full rationale
The paper frames LLM pre-training as lossy compression approaching the Information Bottleneck bound for next-token prediction and reports that compression optimality and retained information correlate with downstream benchmark performance across model families. No load-bearing step reduces a 'prediction' or optimality measure to its own fitted inputs, training loss, or self-citation by construction. The optimality proxy is computed from model rate and estimated mutual information quantities that are distinct from the held-out benchmark scores; the argument is an observed statistical relationship rather than a definitional equivalence or self-referential fit. Potential scale or data-overlap confounds are external validity concerns, not circularity in the derivation.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The information bottleneck bound is the appropriate theoretical limit for next-sequence prediction in LLMs
Reference graph
Works this paper leans on
-
[1]
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
Aggarwal, C. C., Hinneburg, A., & Keim, D. A. (2001). On the surprising behavior of distance metrics in high dimensional space.International Conference on Database Theory (ICDT), 420–434. Allal, L. B., Lozhkov, A., Bakouch, E., Blázquez, G. M., Penedo, G., Tunstall, L., Marafioti, A., Kydlíˇcek, H., Lajarín, A. P., Srivastav, V ., Lochner, J., Fahlgren, C...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.1016/s1364-6613(02)00005-0 2001
-
[2]
Feldman, J. (2000). Minimization of boolean complexity in human concept learning.Nature, 407(6804), 630–633. Feldman, J. (2016). The simplicity principle in perception and cognition.Wiley Interdisciplinary Reviews: Cognitive Science,7(5), 330–340. Fourrier, C., Habib, N., Lozovskaya, A., Szafer, K., & Wolf, T. (2024). Open llm leaderboard v2. Frankle, J.,...
-
[3]
Gibson, E., et al. (2000). The dependency locality theory: A distance-based theory of linguistic complexity.Image, language, brain,2000, 95–126. Goldfeld, Z., Berg, E. v. d., Greenewald, K., Melnyk, I., Nguyen, N., Kingsbury, B., & Polyanskiy, Y . (2019, May). Estimating Information Flow in Deep Neural Networks [arXiv:1810.05728 [cs, stat]]. Retrieved Apr...
-
[4]
Veldhoen, S., Hupkes, D., & Zuidema, W. (2016). Diagnostic classifiers: Revealing how neural networks process hierarchical structure,
work page 2016
-
[5]
Vitányi, P. M., & Li, M. (2000). Minimum description length induction, bayesianism, and kol- mogorov complexity.IEEE Transactions on information theory,46(2), 446–464. V oita, E., Sennrich, R., & Titov, I. (2019, September). The Bottom-up Evolution of Representations in the Transformer: A Study with Machine Translation and Language Modeling Objectives [ar...
-
[6]
released this year are models with 1.7B parameters or smaller that achieve competitive performance. The 1.7B Smol model was trained on 11 Trillion tokens and performs comparably to the 1B OLMo2 model which was trained on 4 Trillion Tokens. Broadly the 1.7B Smol model follows a similar training trajectory to the OLMo2 1B model having phases of expansion an...
work page 2026
-
[7]
(Bottom) On the horziontal axes are is the amount of preference information in each model based on the Tulu dataset. 21 Published as a conference paper at ICLR 2026 Smol LM2 Pre-Training Timecourse 100M, 400M, 1.7B parameters Figure 12:Smol LM2 Timecourses 22 Published as a conference paper at ICLR 2026 Pythia Pre-Training Timecourse 1.4B, 6.9B parameters...
work page 2026
-
[8]
Included are analyses of the 1.4B and 6.9B models. In terms of parametrisation these are roughly comparable to the 1B and 7B OLMo2 models analysed in the main paper. However it’s worth noting that the methodology for training these models is substantially different, and that their performance is substantially lower than the OLMo2 models, and other more re...
work page 2020
-
[9]
E.1.1 TEMPERATURE CALIBRATION. Naively we could use the same temperature across all models, however models differ in the dimen- sionality of their hidden representations. Within self-attention, as the dimensionalityd k of query and key vectors grows, the variance of their dot products scales linearly withd k, pushing the soft- max function into saturated ...
work page 2017
-
[10]
which reduces n-gram size until the n-gram has non-zero probability in a corpus. Here though we do not interpolate dif- ferent n-gram widths, instead maintaining separate aggregate estimates for each width – in part to be able to study how different levels of contextual information are represented in the model. Where a given n-gram, like a quadgram, does ...
work page 2000
-
[11]
includes quadgram estimates for reference. E.3 APPROXIMATING THEOUTPUTDISTRIBUTION During inference models predict the next token given preceding context, but this is distinct from how they are trained. During training of an auto-regressive decoder-only LLM, causal masking means a token can only attend to preceding context, not trailing context. However t...
work page 2026
-
[12]
- opt instead to discretise representations and compute their Shannon entropy (Shannon, 1948). There are two major reasons for this; first, differential entropy is not the true continuous analogue of Shannon Entropy (Jaynes, 1957). This is shown by the fact that differential entropy D(X)is unbounded−∞ ≤D(X)≤ ∞, and variant under linear transformations. Th...
work page 1948
-
[13]
Despite this they note the approach from Shwartz-Ziv and Tishby (2017) was not tractable to apply to the model. They opt instead for quantising representations via clustering, based on related work from Sajjadi et al. (2018). This method runs a clustering algorithm (V oita et al. (2019) use mini-batch k-means), then treats each cluster as an event in a ca...
work page 2017
-
[14]
method for computing channel capacity. Introduced in Tishby et al. (2000), the information bottleneck method for determining channel capacity relies on three equations: pβ(z|x) = pβ(z) Zβ(x) exp −βD[p(y|x)||p β(y|Z)] (12) pβ(z) = X x∈X p(x)pβ(z|x)(13) pβ(y|z) = X x∈X pβ(x|z)p(y|x)(14) These equations are satisfied self-consistently at the bound. As these ...
work page 2000
-
[15]
shows that models transition from the fitting phase where I(Y;Z)increases, to the compression phase whereI(X;Z)decreases when empirical error on the training distribution saturates. Their setting is substantively different to the one studied in our work – the most relevant differences here are that they analyse a feed-forward model trained on MNIST for mu...
work page 2026
-
[16]
and so gives us a proxy for in-distribution performance on the model’s training set. This follows a previously attested dynamic, where earlier steps dramatically decrease the loss before this begins to slowly saturate. Unlike in an MNIST setting this objective never truly saturates, instead slowly flattening. Figure 16 shows this loss plotted against the ...
work page 2017
-
[17]
The results show the same overall pattern of expansion and compression with small changes to the exact mutual information values. Given this estimator resembles a differentiable relaxation of a binning-based estimate, it is relevant to note that in binning based approaches increasing the number of bins reduces mutual information by assigning similar repre...
work page 2003
-
[18]
which falls under the Apache License (Version 2.0). 32 Published as a conference paper at ICLR 2026 We study a wide array of models, below is license information grouped by model family: •OLMo:The code and models are released under Apache 2.0. •Gemma:Released under the gemma license stated here: https://ai.google.dev/gemma/terms •Llama:Released under the ...
work page 2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.