pith. sign in

arxiv: 2604.07569 · v1 · submitted 2026-04-08 · 💻 cs.LG · cs.AI· cs.CL· cs.IT· math.IT

Learning is Forgetting: LLM Training As Lossy Compression

Pith reviewed 2026-05-10 17:24 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CLcs.ITmath.IT
keywords large language modelslossy compressioninformation bottleneckpre-trainingnext-sequence predictionrepresentational structuredownstream performance
0
0 comments X

The pith

LLM pre-training produces models that compress training data near-optimally for next-sequence prediction, and this compression quality predicts downstream benchmark performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper treats large language models as engines of lossy compression that discard information from their training data unless it helps with next-sequence prediction. Pre-training drives each model toward the theoretical minimum information needed, as bounded by the Information Bottleneck principle. Across many open-weight models the distance to this bound, together with the amount of retained information, tracks how well the model will do on a broad set of later tasks. The result supplies a single information-theoretic account that links the structure inside a trained network to measurable capabilities without requiring exhaustive benchmark runs.

Core claim

We argue LLMs are best seen as an instance of lossy compression, where over training they learn by retaining only information in their training data relevant to their objective(s). We show pre-training results in models that are optimally compressed for next-sequence prediction, approaching the Information Bottleneck bound on compression. Across an array of open weights models, each compresses differently, likely due to differences in the data and training recipes used. However even across different families of LLMs the optimality of a model's compression, and the information present in it, can predict downstream performance on across a wide array of benchmarks, letting us directly link the

What carries the argument

The Information Bottleneck bound, which gives the lowest possible rate of retained information that still permits accurate next-sequence prediction.

If this is right

  • Different training recipes and data sets produce measurably different degrees of compression even among models of similar size.
  • A model's closeness to the Information Bottleneck limit supplies a scalar predictor of its accuracy on many unrelated downstream tasks.
  • Representational structure inside a trained network can be read out as retained information and used to forecast capabilities without running every benchmark.
  • The same framing applies uniformly to any model family that optimizes a next-sequence objective.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If compression optimality is the operative variable, then training procedures could be modified to push models closer to the bound more quickly.
  • The view recasts forgetting not as a defect but as the necessary discarding of task-irrelevant information during compression.
  • Benchmark suites might be supplemented or partially replaced by direct measurements of retained information on representative data.

Load-bearing premise

That a model's distance to the compression bound can be measured without reference to the downstream benchmarks and that any observed link to performance reflects a general relationship rather than shared dependence on model scale or data overlap.

What would settle it

Compute the compression rate of two models of matched scale on the same held-out sequences; if the model closer to the bound does not reliably outperform the other on new benchmarks after controlling for scale, the claimed predictive power is refuted.

Figures

Figures reproduced from arXiv: 2604.07569 by Henry C. Conklin, Jonathan D. Cohen, Julian Gold, Max Bartolo, Seraphina Goldfarb-Tarrant, Tan Yi-Chern, Thomas L. Griffiths, Tom Hosking.

Figure 1
Figure 1. Figure 1: LLMs Learn an Optimal Compression of the Internet (Left) The information plane for pre-training of the OLMo2 7B model. The horizontal axis shows mutual information between representations and the input (complexity), the vertical axis shows mutual information with the predicted output (expressivity). The dotted line indicates the bound where models are optimally compressed, hue indicates timepoint in traini… view at source ↗
Figure 2
Figure 2. Figure 2: Illustration of Soft Entropy Estimation: (Top) These facets illustrate the normalisation, sampling, and soft assignment formalised in equation 2. (Bottom) Soft Assignments are aggregated into a distribution that describes the space P(Zˆ) of which we take the Shannon entropy (equation 3.1). An interactive visual of this process is available here 3.2 MUTUAL INFORMATIONS & BACK-OFF To determine whether or not… view at source ↗
Figure 3
Figure 3. Figure 3: Illustration of conditional probability estimates. An example sentence is provided, assuming word-level tokenization for simplicity. At left are the indices for the input and output tokens when the current input word is wherefore. At right is shown the sub-setting procedure for estimating conditional probabilities. This illustrates how bigram estimates do not compute entropy of two token embeddings, rather… view at source ↗
Figure 4
Figure 4. Figure 4: Models Largely Encode Local Context. (Top) The information plane over pre-training for the different levels of backoff. By changing how many tokens we condition the mutual infor￾mation on in the context window, we see how the OLMo2 7B model compresses not just token but also local context information. Across all context windows we see the same two phase pattern predicted by the Information Bottleneck – wit… view at source ↗
Figure 5
Figure 5. Figure 5: Models Converge Along the Bound With Smaller Models Struggling to Compress. (Top Left) Open-Weights models across 6 families at the end of training, lie along the bound on optimal compression. Hue indicates performance on MMLU Pro. (Top Right) The vertical axis indicates mutual information with preference, with models with more preference information ex￾hibiting better performance (Bottom) Zooming in on la… view at source ↗
Figure 6
Figure 6. Figure 6: Representation Information Relates Significantly to Performance Vertical axes, shared across all plots, show aggregate performance across 6 benchmarks (MMLU Pro, BBH, Math LVL5, IFEval, GPQA, MuSR). Horizontal axes use token back-off to show complexity significantly corre￾lates with downstream performance (Top Left), while expressivity alone does not (Top Right). The ratio of how many bits of expressivity … view at source ↗
Figure 7
Figure 7. Figure 7: Proportions of Information Vs. Performance: Across 47 models less token information and more local contextual information relates significantly to performance based on a spearman correlation reported above each facet. Hue indicate model, legend is provided in figure 6. Results here focus on aggregate performance across 6 benchmarks, in Appendix C we discuss each of the benchmarks individually. At the indiv… view at source ↗
Figure 8
Figure 8. Figure 8: Token & Bigram Information Plane for Open Weights Models Shown here is the full, labelled token information plane for 75 open-weights models. While models lie at different levels of complexity and expressivity, they broadly approach the IB Bound on optimal compression. Hue indicates optimality - or proximity to the bound. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Trigram and Quadgram Information Plane Shown here is the full, labelled trigram and quadgram information plane for 75 open-weights models. Compared with the token case above, here models lie even closer to the frontier. The quadgram estimates are noisy due to sample sparsity, this combined with the fact that all models are close to the bound results in some estimates appearing to cross the bound. B POST-TR… view at source ↗
Figure 10
Figure 10. Figure 10: Post-Training and Preference Information (Above) Preference information on the vertical axis against whether or not the model is post-trained on the horizontal axis, with significance values from a paired permutation test above. (Below) Again, preference information on the vertical axis against optimality of a model’s compression on the horizontal axis axis. training is designed to improve a model’s abili… view at source ↗
Figure 11
Figure 11. Figure 11: Individual Task Performance Relationships: Shown on the vertical axis are individual task accuracies with each facet representing a different task. (Top) On the horizontal axis are opti￾mality scores on C4 across 47 different open weights models hue indicating model id using the same legend as [PITH_FULL_IMAGE:figures/full_fig_p021_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Smol LM2 Timecourses 22 [PITH_FULL_IMAGE:figures/full_fig_p022_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Pythia Model Timecourses 23 [PITH_FULL_IMAGE:figures/full_fig_p023_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Illustration of conditional probability estimates. An example sentence is provided, assuming word-level tokenization for simplicity. At left are the indices for the input and output tokens when the current input word is wherefore. At right is shown the sub-setting procedure for estimating conditional probabilities. This illustrates how bigram estimates do not compute entropy of two token embeddings, rathe… view at source ↗
Figure 15
Figure 15. Figure 15: Proportions of Information Vs. Performance: Across 47 models less token informa￾tion and more local contextual information relates significantly to performance based on a spearman correlation. n-gram width n we can get a proportion ϕ of model information by normalising by the entropy of the model. ϕ(x, n) = I(Z; xn|x1..xn−1) H(Z) (9) We compute this for each level of backoff, where at the token level ϕ(x,… view at source ↗
Figure 16
Figure 16. Figure 16: Cross-Entropy Loss vs. Distance to bound. Shown on the vertical axis is the OLMo2 7b model’s cross-entropy loss on 10,000 examples from c4. On the horizontal axis is the ratio between I(Y ;Z) and I(X;Z) which is indicative of how close a representation is to the IB bound. Models begin to compress and approach the bound as the loss saturates. We compute the cross-entropy loss for the OLMo2 7b model perform… view at source ↗
Figure 17
Figure 17. Figure 17: Estimator Robustness to number of bins and data distribution. Shown are trajectories through the information plane for the OLMo2 7b model. (Top) trajectories in the main paper use 100 reference points wi per layer, here 50 points are used, and show the same overall two-phase pattern. (bottom) estimates in the main paper are with respect to c4 given its resemblance to an LLM’s training distribution, here a… view at source ↗
Figure 18
Figure 18. Figure 18: Pre-training Time-courses Computed with a Mean vs. Expectation Shown are tra￾jectories through the information plane over pre-training for the OLMo2 1b, 7b, 32b models. These analyses use a mean (top) or expectation (bottom) in computation of mutual information. The ex￾pectation is used in the main paper as it reflects true mutual information. Hue indicates tokens in billions over the course of pre-traini… view at source ↗
read the original abstract

Despite the increasing prevalence of large language models (LLMs), we still have a limited understanding of how their representational spaces are structured. This limits our ability to interpret how and what they learn or relate them to learning in humans. We argue LLMs are best seen as an instance of lossy compression, where over training they learn by retaining only information in their training data relevant to their objective(s). We show pre-training results in models that are optimally compressed for next-sequence prediction, approaching the Information Bottleneck bound on compression. Across an array of open weights models, each compresses differently, likely due to differences in the data and training recipes used. However even across different families of LLMs the optimality of a model's compression, and the information present in it, can predict downstream performance on across a wide array of benchmarks, letting us directly link representational structure to actionable insights about model performance. In the general case the work presented here offers a unified Information-Theoretic framing for how these models learn that is deployable at scale.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims that LLM pre-training constitutes lossy compression, with models approaching the Information Bottleneck bound for next-sequence prediction. It further claims that scalar measures of compression optimality and retained information extracted from the model can predict downstream benchmark performance across model families, providing a unified information-theoretic account of learning that is actionable at scale.

Significance. If the claims are substantiated with independent measurements and confound controls, the work would supply a principled way to connect representational structure directly to performance prediction, potentially reducing reliance on exhaustive benchmarking and offering an IT lens that generalizes across architectures. The cross-family scope using open-weight models is a strength that could support broader applicability if the metrics prove robust.

major comments (3)
  1. [Abstract] Abstract: The central claim that pre-training produces models 'optimally compressed for next-sequence prediction, approaching the Information Bottleneck bound' is presented without any description of the IB formulation used, the estimators for rate or mutual information, or the procedure for quantifying 'approach' to the bound. This measurement detail is load-bearing for both the optimality assertion and the subsequent performance-prediction claim.
  2. [Results section (cross-family experiments)] Results section (cross-family experiments): The reported ability of compression optimality to predict downstream performance 'even across different families' lacks any indication of controls for model scale (parameter count) or data-overlap statistics. Without regression covariates, partial correlations, or ablation on size-matched subsets, the correlation is consistent with the known scale-performance relationship rather than an independent effect of representational compression.
  3. [Methods] Methods: The manuscript must demonstrate that the compression-optimality scalar is computed from quantities independent of both the next-token training loss and the downstream benchmark distributions. If the metric relies on perplexity, layer-wise statistics, or effective rate derived from the same training corpus, the predictive relationship risks circularity and does not establish a new causal or structural link.
minor comments (2)
  1. [Abstract] Abstract contains a phrasing error: 'predict downstream performance on across a wide array of benchmarks' should read 'on a wide array of benchmarks'.
  2. The paper would benefit from an explicit equation or pseudocode for the compression-optimality metric in the main text (or a dedicated appendix) to enable reproduction.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive comments, which have helped us identify areas to strengthen the manuscript. We address each major comment in turn and outline the revisions we will make.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that pre-training produces models 'optimally compressed for next-sequence prediction, approaching the Information Bottleneck bound' is presented without any description of the IB formulation used, the estimators for rate or mutual information, or the procedure for quantifying 'approach' to the bound. This measurement detail is load-bearing for both the optimality assertion and the subsequent performance-prediction claim.

    Authors: We agree that the abstract, being a high-level summary, omits the specific technical details of the Information Bottleneck (IB) formulation. The full manuscript's Methods section details the variational IB bound for next-sequence prediction, including the estimators for the rate (using a variational approximation to the mutual information between inputs and representations) and the distortion term tied to the prediction objective, along with the procedure for measuring proximity to the bound via normalized rate-distortion curves. To improve accessibility, we will revise the abstract to briefly mention the use of variational IB estimators and the approach to the bound, while keeping it concise. revision: partial

  2. Referee: [Results section (cross-family experiments)] Results section (cross-family experiments): The reported ability of compression optimality to predict downstream performance 'even across different families' lacks any indication of controls for model scale (parameter count) or data-overlap statistics. Without regression covariates, partial correlations, or ablation on size-matched subsets, the correlation is consistent with the known scale-performance relationship rather than an independent effect of representational compression.

    Authors: This is a fair critique regarding potential confounds. While our experiments span multiple model families with varying scales, we did not include explicit controls such as partial correlations or size-matched ablations in the reported results. We will add these analyses in the revised version, demonstrating that the compression optimality metric retains significant predictive power for benchmark performance even after controlling for parameter count. Regarding data-overlap, we will include a discussion noting that cross-family open models reduce overlap risks, supported by available metadata on training data sources. revision: yes

  3. Referee: [Methods] Methods: The manuscript must demonstrate that the compression-optimality scalar is computed from quantities independent of both the next-token training loss and the downstream benchmark distributions. If the metric relies on perplexity, layer-wise statistics, or effective rate derived from the same training corpus, the predictive relationship risks circularity and does not establish a new causal or structural link.

    Authors: We appreciate the need to clarify independence to avoid any perception of circularity. The compression-optimality scalar is computed using a separate information-theoretic probe: specifically, it derives from estimates of mutual information between the model's hidden activations (on a held-out validation set disjoint from training) and the input sequences, using a non-parametric estimator that does not rely on the training loss or perplexity. The retained information measure uses similar probes independent of downstream benchmarks. We will expand the Methods section with additional details, including pseudocode and explicit statements confirming the use of held-out data and independence from training objectives and benchmark distributions. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical claims rest on independent measurements.

full rationale

The paper frames LLM pre-training as lossy compression approaching the Information Bottleneck bound for next-token prediction and reports that compression optimality and retained information correlate with downstream benchmark performance across model families. No load-bearing step reduces a 'prediction' or optimality measure to its own fitted inputs, training loss, or self-citation by construction. The optimality proxy is computed from model rate and estimated mutual information quantities that are distinct from the held-out benchmark scores; the argument is an observed statistical relationship rather than a definitional equivalence or self-referential fit. Potential scale or data-overlap confounds are external validity concerns, not circularity in the derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the applicability of the information bottleneck to next-token prediction and on the assumption that compression can be quantified in a manner that is both optimal and predictive of unrelated benchmarks. No free parameters or invented entities are mentioned in the abstract.

axioms (1)
  • domain assumption The information bottleneck bound is the appropriate theoretical limit for next-sequence prediction in LLMs
    Invoked to assert that models approach optimality during pre-training

pith-pipeline@v0.9.0 · 5514 in / 1184 out tokens · 50291 ms · 2026-05-10T17:24:30.933587+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

18 extracted references · 18 canonical work pages · 1 internal anchor

  1. [1]

    Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

    Aggarwal, C. C., Hinneburg, A., & Keim, D. A. (2001). On the surprising behavior of distance metrics in high dimensional space.International Conference on Database Theory (ICDT), 420–434. Allal, L. B., Lozhkov, A., Bakouch, E., Blázquez, G. M., Penedo, G., Tunstall, L., Marafioti, A., Kydlíˇcek, H., Lajarín, A. P., Srivastav, V ., Lochner, J., Fahlgren, C...

  2. [2]

    Feldman, J. (2000). Minimization of boolean complexity in human concept learning.Nature, 407(6804), 630–633. Feldman, J. (2016). The simplicity principle in perception and cognition.Wiley Interdisciplinary Reviews: Cognitive Science,7(5), 330–340. Fourrier, C., Habib, N., Lozovskaya, A., Szafer, K., & Wolf, T. (2024). Open llm leaderboard v2. Frankle, J.,...

  3. [3]

    Gibson, E., et al. (2000). The dependency locality theory: A distance-based theory of linguistic complexity.Image, language, brain,2000, 95–126. Goldfeld, Z., Berg, E. v. d., Greenewald, K., Melnyk, I., Nguyen, N., Kingsbury, B., & Polyanskiy, Y . (2019, May). Estimating Information Flow in Deep Neural Networks [arXiv:1810.05728 [cs, stat]]. Retrieved Apr...

  4. [4]

    Veldhoen, S., Hupkes, D., & Zuidema, W. (2016). Diagnostic classifiers: Revealing how neural networks process hierarchical structure,

  5. [5]

    The bottom-up evolution of representations in the transformer: A study with machine translation and language modeling objectives

    Vitányi, P. M., & Li, M. (2000). Minimum description length induction, bayesianism, and kol- mogorov complexity.IEEE Transactions on information theory,46(2), 446–464. V oita, E., Sennrich, R., & Titov, I. (2019, September). The Bottom-up Evolution of Representations in the Transformer: A Study with Machine Translation and Language Modeling Objectives [ar...

  6. [6]

    The 1.7B Smol model was trained on 11 Trillion tokens and performs comparably to the 1B OLMo2 model which was trained on 4 Trillion Tokens

    released this year are models with 1.7B parameters or smaller that achieve competitive performance. The 1.7B Smol model was trained on 11 Trillion tokens and performs comparably to the 1B OLMo2 model which was trained on 4 Trillion Tokens. Broadly the 1.7B Smol model follows a similar training trajectory to the OLMo2 1B model having phases of expansion an...

  7. [7]

    (Bottom) On the horziontal axes are is the amount of preference information in each model based on the Tulu dataset. 21 Published as a conference paper at ICLR 2026 Smol LM2 Pre-Training Timecourse 100M, 400M, 1.7B parameters Figure 12:Smol LM2 Timecourses 22 Published as a conference paper at ICLR 2026 Pythia Pre-Training Timecourse 1.4B, 6.9B parameters...

  8. [8]

    In terms of parametrisation these are roughly comparable to the 1B and 7B OLMo2 models analysed in the main paper

    Included are analyses of the 1.4B and 6.9B models. In terms of parametrisation these are roughly comparable to the 1B and 7B OLMo2 models analysed in the main paper. However it’s worth noting that the methodology for training these models is substantially different, and that their performance is substantially lower than the OLMo2 models, and other more re...

  9. [9]

    Naively we could use the same temperature across all models, however models differ in the dimen- sionality of their hidden representations

    E.1.1 TEMPERATURE CALIBRATION. Naively we could use the same temperature across all models, however models differ in the dimen- sionality of their hidden representations. Within self-attention, as the dimensionalityd k of query and key vectors grows, the variance of their dot products scales linearly withd k, pushing the soft- max function into saturated ...

  10. [10]

    which reduces n-gram size until the n-gram has non-zero probability in a corpus. Here though we do not interpolate dif- ferent n-gram widths, instead maintaining separate aggregate estimates for each width – in part to be able to study how different levels of contextual information are represented in the model. Where a given n-gram, like a quadgram, does ...

  11. [11]

    E.3 APPROXIMATING THEOUTPUTDISTRIBUTION During inference models predict the next token given preceding context, but this is distinct from how they are trained

    includes quadgram estimates for reference. E.3 APPROXIMATING THEOUTPUTDISTRIBUTION During inference models predict the next token given preceding context, but this is distinct from how they are trained. During training of an auto-regressive decoder-only LLM, causal masking means a token can only attend to preceding context, not trailing context. However t...

  12. [12]

    There are two major reasons for this; first, differential entropy is not the true continuous analogue of Shannon Entropy (Jaynes, 1957)

    - opt instead to discretise representations and compute their Shannon entropy (Shannon, 1948). There are two major reasons for this; first, differential entropy is not the true continuous analogue of Shannon Entropy (Jaynes, 1957). This is shown by the fact that differential entropy D(X)is unbounded−∞ ≤D(X)≤ ∞, and variant under linear transformations. Th...

  13. [13]

    They opt instead for quantising representations via clustering, based on related work from Sajjadi et al

    Despite this they note the approach from Shwartz-Ziv and Tishby (2017) was not tractable to apply to the model. They opt instead for quantising representations via clustering, based on related work from Sajjadi et al. (2018). This method runs a clustering algorithm (V oita et al. (2019) use mini-batch k-means), then treats each cluster as an event in a ca...

  14. [14]

    Introduced in Tishby et al

    method for computing channel capacity. Introduced in Tishby et al. (2000), the information bottleneck method for determining channel capacity relies on three equations: pβ(z|x) = pβ(z) Zβ(x) exp −βD[p(y|x)||p β(y|Z)] (12) pβ(z) = X x∈X p(x)pβ(z|x)(13) pβ(y|z) = X x∈X pβ(x|z)p(y|x)(14) These equations are satisfied self-consistently at the bound. As these ...

  15. [15]

    shows that models transition from the fitting phase where I(Y;Z)increases, to the compression phase whereI(X;Z)decreases when empirical error on the training distribution saturates. Their setting is substantively different to the one studied in our work – the most relevant differences here are that they analyse a feed-forward model trained on MNIST for mu...

  16. [16]

    This follows a previously attested dynamic, where earlier steps dramatically decrease the loss before this begins to slowly saturate

    and so gives us a proxy for in-distribution performance on the model’s training set. This follows a previously attested dynamic, where earlier steps dramatically decrease the loss before this begins to slowly saturate. Unlike in an MNIST setting this objective never truly saturates, instead slowly flattening. Figure 16 shows this loss plotted against the ...

  17. [17]

    The results show the same overall pattern of expansion and compression with small changes to the exact mutual information values. Given this estimator resembles a differentiable relaxation of a binning-based estimate, it is relevant to note that in binning based approaches increasing the number of bins reduces mutual information by assigning similar repre...

  18. [18]

    which falls under the Apache License (Version 2.0). 32 Published as a conference paper at ICLR 2026 We study a wide array of models, below is license information grouped by model family: •OLMo:The code and models are released under Apache 2.0. •Gemma:Released under the gemma license stated here: https://ai.google.dev/gemma/terms •Llama:Released under the ...