Language Modeling Is Compression
Pith reviewed 2026-05-17 22:31 UTC · model grok-4.3
pith:XTLBBB6Y Add to your LaTeX paper
What is a Pith Number?\usepackage{pith}
\pithnumber{XTLBBB6Y}
Prints a linked pith:XTLBBB6Y badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more
The pith
Large language models trained on text compress images and audio better than specialized tools.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Predictive models convert directly into lossless compressors through arithmetic coding on their output distributions, allowing large self-supervised language models to function as powerful cross-domain compressors that surpass modality-specific baselines on image patches and speech samples.
What carries the argument
The prediction-compression equivalence realized by feeding a model's next-token predictive distribution into arithmetic coding to produce a bitstream.
If this is right
- Scaling laws measured in language modeling also describe compression performance on non-text data.
- Tokenization choices directly affect how efficiently a language model compresses a given data type.
- In-context examples improve compression of new sequences by adapting the model's predictive distribution.
- The same equivalence lets any compressor such as gzip serve as the core of a conditional generative model.
Where Pith is reading between the lines
- Objectives that optimize compression ratio directly could replace next-token prediction for multi-modal training.
- Compression performance on held-out modalities offers a practical test of whether a model has learned general structure.
- A single model could handle text, images, and audio by learning a shared predictive distribution that serves both generation and compression.
Load-bearing premise
The model's predictive probabilities can be fed into standard arithmetic coding to realize the reported compression ratios without large practical overhead or coding losses.
What would settle it
Measuring that arithmetic coding driven by Chinchilla 70B predictions on ImageNet patches yields ratios no better than PNG would falsify the central performance claim.
read the original abstract
It has long been established that predictive models can be transformed into lossless compressors and vice versa. Incidentally, in recent years, the machine learning community has focused on training increasingly large and powerful self-supervised (language) models. Since these large language models exhibit impressive predictive capabilities, they are well-positioned to be strong compressors. In this work, we advocate for viewing the prediction problem through the lens of compression and evaluate the compression capabilities of large (foundation) models. We show that large language models are powerful general-purpose predictors and that the compression viewpoint provides novel insights into scaling laws, tokenization, and in-context learning. For example, Chinchilla 70B, while trained primarily on text, compresses ImageNet patches to 43.4% and LibriSpeech samples to 16.4% of their raw size, beating domain-specific compressors like PNG (58.5%) or FLAC (30.3%), respectively. Finally, we show that the prediction-compression equivalence allows us to use any compressor (like gzip) to build a conditional generative model.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that large language models are powerful general-purpose predictors and compressors due to the long-established equivalence between prediction and lossless compression. It evaluates this empirically, showing that Chinchilla 70B (primarily text-trained) achieves compression ratios of 43.4% on ImageNet patches and 16.4% on LibriSpeech samples, outperforming domain-specific baselines like PNG (58.5%) and FLAC (30.3%). The work further uses the equivalence to derive insights on scaling laws, tokenization, and in-context learning, and demonstrates constructing conditional generative models from arbitrary compressors such as gzip.
Significance. If the reported ratios reflect actual implemented compression lengths rather than purely theoretical entropy, the results would demonstrate the surprising cross-modal generality of large language models and provide a concrete, falsifiable link between scaling and compression performance. This framing could influence model evaluation practices and open avenues for using compression metrics to study in-context learning and tokenization choices.
major comments (1)
- [Experiments on cross-modal compression] The experimental section reporting the Chinchilla 70B results on ImageNet patches (43.4%) and LibriSpeech (16.4%) must clarify whether the ratios are computed as the theoretical codelength (-∑ log₂ p_i) or as the actual output length after running a concrete arithmetic coder. If the former, any finite-precision, termination, or context overhead should be quantified and shown not to alter the superiority over PNG/FLAC baselines.
minor comments (1)
- [Abstract and discussion] The abstract promises novel insights into scaling laws, tokenization, and in-context learning via the compression lens; the corresponding discussion sections would benefit from explicit pointers or short subsections that directly tie each insight back to the compression equivalence.
Simulated Author's Rebuttal
We thank the referee for their positive summary, recognition of the work's significance, and recommendation for minor revision. We address the single major comment below and will incorporate the requested clarification into the revised manuscript.
read point-by-point responses
-
Referee: [Experiments on cross-modal compression] The experimental section reporting the Chinchilla 70B results on ImageNet patches (43.4%) and LibriSpeech (16.4%) must clarify whether the ratios are computed as the theoretical codelength (-∑ log₂ p_i) or as the actual output length after running a concrete arithmetic coder. If the former, any finite-precision, termination, or context overhead should be quantified and shown not to alter the superiority over PNG/FLAC baselines.
Authors: We thank the referee for highlighting this important point of clarification. The reported ratios (43.4% on ImageNet patches and 16.4% on LibriSpeech) are computed as the theoretical codelength -∑ log₂ p_i using the model's next-token (or next-patch) predictive probabilities, in direct accordance with the prediction-compression equivalence discussed in the paper. We did not run a concrete arithmetic coder for these cross-modal results, as the goal was to evaluate the LLM's raw predictive power as a general-purpose compressor. In the revised manuscript we will explicitly state this in the experimental section. Regarding overheads, standard arithmetic coding incurs only a small constant overhead (typically O(1) bits for termination plus a few bits for finite-precision renormalization and context flushing). For the sample lengths used here (hundreds to thousands of tokens/patches), this overhead is negligible (<0.1% relative to total codelength) and cannot reverse the reported advantage over PNG (58.5%) or FLAC (30.3%). We will add a short paragraph or footnote providing this bound and confirming that the superiority holds under any practical implementation. revision: yes
Circularity Check
No significant circularity; empirical claims rest on direct measurement under established equivalence
full rationale
The paper invokes the long-established equivalence between predictors and lossless compressors (a standard result in information theory, not derived or cited from the authors' prior work here) and then reports measured compression ratios on held-out data from ImageNet and LibriSpeech. These ratios are obtained by applying the fixed Chinchilla model to new token sequences and comparing encoded length to raw size; they are not obtained by fitting parameters to the target ratios, renaming a known pattern, or reducing via self-citation to an unverified premise. The derivation chain is therefore self-contained against external benchmarks and does not collapse to its inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- standard math Any sufficiently accurate probabilistic predictor can be converted into a lossless compressor via arithmetic coding.
Forward citations
Cited by 18 Pith papers
-
Effective Context in Transformers: An Analysis of Fragmentation and Tokenization
Fragmentation strictly raises optimal finite-context log-loss on Markov sources while tokenization can make a short token window equivalent to a longer source window under reliability and compression conditions.
-
Are Flat Minima an Illusion?
Flat minima are illusory; generalization is driven by weakness, a reparameterization-invariant measure of compatible completions that predicts performance better than sharpness on MNIST and Fashion-MNIST.
-
Manifold Steering Reveals the Shared Geometry of Neural Network Representation and Behavior
Manifold steering along activation geometry induces behavioral trajectories matching the natural manifold of outputs, while linear steering produces off-manifold unnatural behaviors.
-
Sequential KV Cache Compression via Probabilistic Language Tries: Beyond the Per-Vector Shannon Limit
Sequential KV compression via probabilistic language tries and predictive delta coding achieves 3.3-4.3 bits per token entropy, yielding up to 914x better ratios than TurboQuant even with large overhead.
-
Hierarchical SVG Tokenization: Learning Compact Visual Programs for Scalable Vector Graphics Modeling
HiVG introduces hierarchical SVG tokenization with atomic and segment tokens plus HMN initialization to enable more efficient and stable autoregressive generation of vector graphics programs.
-
HE-SNR: Uncovering Latent Logic via Entropy for Guiding Mid-Training on SWE-bench
HE-SNR is a high-entropy signal-to-noise ratio metric derived from the Entropy Compression Hypothesis to better guide LLM mid-training on complex software engineering benchmarks.
-
Learn-to-learn on Arbitrary Textual Conditioning: A Hypernetwork-Driven Meta-Gated LLM
A hypernetwork generates meta-gating parameters for SwiGLU blocks to let LLMs adapt their nonlinearity to arbitrary textual conditions, outperforming finetuning and meta-learning baselines with reasonable generalizati...
-
Xiaomi OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation
OneVL is the first latent CoT method to exceed explicit CoT accuracy on four driving benchmarks while running at answer-only speed, by supervising latent tokens with a visual world model decoder.
-
Xiaomi OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation
OneVL achieves superior accuracy to explicit chain-of-thought reasoning at answer-only latency by supervising latent tokens with a visual world model decoder that predicts future frames.
-
Lossless Compression via Chained Lightweight Neural Predictors with Information Inheritance
A new chain of lightweight neural predictors with information inheritance achieves near state-of-the-art lossless compression ratios while delivering 1.2-6.3x faster encoding and 2.8-12.3x faster decoding than PAC on GPUs.
-
Probabilistic Language Tries: A Unified Framework for Compression, Decision Policies, and Execution Reuse
Probabilistic language tries unify compression, sequential decision making, and inference caching by making explicit the prefix structure of any generative model over sequences.
-
Scaling Synthetic Data Creation with 1,000,000,000 Personas
A curated set of one billion personas enables scalable, diverse synthetic data generation for LLM training across reasoning, instructions, knowledge, NPCs, and tools.
-
Model Capacity Determines Grokking through Competing Memorisation and Generalisation Speeds
Grokking emerges near the model size where memorization timescale T_mem(P) intersects generalization timescale T_gen(P) on modular arithmetic.
-
Efficient Learned Data Compression via Dual-Stream Feature Decoupling
A dual-stream decoupler plus hierarchical refiner and parallel pipeline yields state-of-the-art compression ratio and throughput with lowest reported latency and memory in learned data compression.
-
Efficient compression of neural networks and datasets
Refined probabilistic and smooth l0 pruning techniques approximate minimum description length for neural networks, achieving high compression with minimal accuracy loss and empirically verifying better sample efficien...
-
Hallucination of Multimodal Large Language Models: A Survey
The survey organizes causes of hallucinations in MLLMs, reviews evaluation benchmarks and metrics, and outlines mitigation approaches plus open questions.
-
A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions
The paper surveys hallucination in LLMs with an innovative taxonomy, factors, detection methods, benchmarks, mitigation strategies, and open research directions.
-
The Rhetoric of Machine Learning
Machine learning is inherently rhetorical and is often deployed as 'manipulation as a service' in business models.
Reference graph
Works this paper leans on
-
[1]
On the Opportunities and Risks of Foundation Models
Rishi Bommasani et al. On the opportunities and risks of foundation models. arXiv:2108.07258,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Sparks of Artificial General Intelligence: Early experiments with GPT-4
Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott M. Lundberg, Harsha Nori, Hamid Palangi, Marco Túlio Ribeiro, and Yi Zhang. Sparks of artificial general intelligence: Early experiments with GPT-4. arXiv:2303.12712,
work page internal anchor Pith review Pith/arXiv arXiv
- [3]
-
[4]
A Survey of Model Compression and Acceleration for Deep Neural Networks
Yu Cheng, Duo Wang, Pan Zhou, and Tao Zhang. A survey of model compression and acceleration for deep neural networks. arXiv:1710.09282,
-
[5]
Syntactically Informed Text Compression with Recurrent Neural Networks
URL https://xiph.org/flac. David Cox. Syntactically informed text compression with recurrent neural networks. arXiv:1608.02893,
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
Jarek Duda. Asymmetric numeral systems. arXiv:0902.0271,
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
In-context autoencoder for context compression in a large language model
Tao Ge, Jing Hu, Xun Wang, Si-Qing Chen, and Furu Wei. In-context autoencoder for context compression in a large language model. arXiv:2307.06945,
-
[8]
Dzip: Improved general- purpose lossless compression based on novel neural network modeling
10 Published as a conference paper at ICLR 2024 Mohit Goyal, Kedar Tatwawadi, Shubham Chandak, and Idoia Ochoa. Dzip: Improved general- purpose lossless compression based on novel neural network modeling. In DCC,
work page 2024
-
[9]
Training Compute-Optimal Large Language Models
Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, et al. Training compute-optimal large language models. arXiv:2203.15556,
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
Scaling Laws for Neural Language Models
Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv:2001.08361,
work page internal anchor Pith review Pith/arXiv arXiv 2001
-
[11]
TRACE: A fast transformer-based general- purpose lossless compressor
11 Published as a conference paper at ICLR 2024 Yu Mao, Yufei Cui, Tei-Wei Kuo, and Chun Jason Xue. TRACE: A fast transformer-based general- purpose lossless compressor. In WWW,
work page 2024
- [12]
-
[13]
Scaling Language Models: Methods, Analysis & Insights from Training Gopher
Jack W. Rae et al. Scaling language models: Methods, analysis & insights from training gopher. arXiv:2112.11446,
work page internal anchor Pith review Pith/arXiv arXiv
-
[14]
Neural machine translation of rare words with subword units
12 Published as a conference paper at ICLR 2024 Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword units. In ACL (1),
work page 2024
-
[15]
LLaMA: Open and Efficient Foundation Language Models
Hugo Touvron, Thibaut Lavril, Gautier Izacard, et al. Llama: Open and efficient foundation language models. arXiv:2302.13971, 2023a. Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton-Ferrer, Moya Chen, Guillem...
work page internal anchor Pith review Pith/arXiv arXiv
-
[16]
Llmzip: Lossless text compression using large language models
Chandra Shekhara Kaushik Valmeekam, Krishna Narayanan, Dileep Kalathil, Jean-François Cham- berland, and Srinivas Shakkottai. Llmzip: Lossless text compression using large language models. arXiv:2306.04050,
-
[17]
13 Published as a conference paper at ICLR 2024 A A RITHMETIC CODING Here we provide a step-by-step explanation of the arithmetic encoding example visualized in Fig
work page 2024
-
[18]
Concretely, this yields the binary sequence
in half until it is fully contained in I. Concretely, this yields the binary sequence. • b0 → [0, 0.5) • b01 → [0.25, 0.5) • b010 → [0.25, 0.375) • b0101 → [0.3125, 0.375) • b01010 → [0.3125, 0.34375) • b010101 → [0.328125, 0.34375) • b0101010 → [0.328125, 0.3359375) As [0.328125, 0.3359375) is fully contained in I = [0.322, 0341), the compressed output i...
work page 2048
-
[19]
Accordingly, we renormalize the top-k log-probabilities. The Transformer models we trained specifically on enwik do not use any tokenization, except in Section 3.6. The reasoning above also holds, except that our models returned the full distribution over tokens, and not only the top-k. C A DDITIONAL RESULTS Fig. C.1, Fig. 3 and Fig. C.3 show data autoreg...
work page 2024
-
[20]
Former Lord Chancellors and holders of other high judicial office may also sit as Law Lords under the Appellate Jurisdiction Act, although in practice this right is infrequently exercised. After the coming into force of the Constitutional Reform Act 2005, the Lords of Appeal in Ordinary will become judges of the Supreme Court of the United Kingdom and wil...
work page 2005
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.