pith. sign in

arxiv: 2309.10668 · v2 · pith:XTLBBB6Ynew · submitted 2023-09-19 · 💻 cs.LG · cs.AI· cs.CL· cs.IT· math.IT

Language Modeling Is Compression

Pith reviewed 2026-05-17 22:31 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CLcs.ITmath.IT
keywords language modelinglossless compressionarithmetic codingscaling lawsin-context learningfoundation modelscross-modal predictionpredictive modeling
0
0 comments X p. Extension
pith:XTLBBB6Y Add to your LaTeX paper What is a Pith Number?
\usepackage{pith}
\pithnumber{XTLBBB6Y}

Prints a linked pith:XTLBBB6Y badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

The pith

Large language models trained on text compress images and audio better than specialized tools.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Because any good predictor can be turned into a lossless compressor and vice versa, large language models are positioned to act as strong general-purpose compressors. Chinchilla 70B reduces ImageNet patches to 43.4 percent of raw size and LibriSpeech samples to 16.4 percent, outperforming PNG at 58.5 percent and FLAC at 30.3 percent. The same equivalence supplies fresh explanations for scaling laws, the role of tokenization, and how in-context learning works. Any existing compressor can also be run backward to produce conditional generative models.

Core claim

Predictive models convert directly into lossless compressors through arithmetic coding on their output distributions, allowing large self-supervised language models to function as powerful cross-domain compressors that surpass modality-specific baselines on image patches and speech samples.

What carries the argument

The prediction-compression equivalence realized by feeding a model's next-token predictive distribution into arithmetic coding to produce a bitstream.

If this is right

  • Scaling laws measured in language modeling also describe compression performance on non-text data.
  • Tokenization choices directly affect how efficiently a language model compresses a given data type.
  • In-context examples improve compression of new sequences by adapting the model's predictive distribution.
  • The same equivalence lets any compressor such as gzip serve as the core of a conditional generative model.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Objectives that optimize compression ratio directly could replace next-token prediction for multi-modal training.
  • Compression performance on held-out modalities offers a practical test of whether a model has learned general structure.
  • A single model could handle text, images, and audio by learning a shared predictive distribution that serves both generation and compression.

Load-bearing premise

The model's predictive probabilities can be fed into standard arithmetic coding to realize the reported compression ratios without large practical overhead or coding losses.

What would settle it

Measuring that arithmetic coding driven by Chinchilla 70B predictions on ImageNet patches yields ratios no better than PNG would falsify the central performance claim.

read the original abstract

It has long been established that predictive models can be transformed into lossless compressors and vice versa. Incidentally, in recent years, the machine learning community has focused on training increasingly large and powerful self-supervised (language) models. Since these large language models exhibit impressive predictive capabilities, they are well-positioned to be strong compressors. In this work, we advocate for viewing the prediction problem through the lens of compression and evaluate the compression capabilities of large (foundation) models. We show that large language models are powerful general-purpose predictors and that the compression viewpoint provides novel insights into scaling laws, tokenization, and in-context learning. For example, Chinchilla 70B, while trained primarily on text, compresses ImageNet patches to 43.4% and LibriSpeech samples to 16.4% of their raw size, beating domain-specific compressors like PNG (58.5%) or FLAC (30.3%), respectively. Finally, we show that the prediction-compression equivalence allows us to use any compressor (like gzip) to build a conditional generative model.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper claims that large language models are powerful general-purpose predictors and compressors due to the long-established equivalence between prediction and lossless compression. It evaluates this empirically, showing that Chinchilla 70B (primarily text-trained) achieves compression ratios of 43.4% on ImageNet patches and 16.4% on LibriSpeech samples, outperforming domain-specific baselines like PNG (58.5%) and FLAC (30.3%). The work further uses the equivalence to derive insights on scaling laws, tokenization, and in-context learning, and demonstrates constructing conditional generative models from arbitrary compressors such as gzip.

Significance. If the reported ratios reflect actual implemented compression lengths rather than purely theoretical entropy, the results would demonstrate the surprising cross-modal generality of large language models and provide a concrete, falsifiable link between scaling and compression performance. This framing could influence model evaluation practices and open avenues for using compression metrics to study in-context learning and tokenization choices.

major comments (1)
  1. [Experiments on cross-modal compression] The experimental section reporting the Chinchilla 70B results on ImageNet patches (43.4%) and LibriSpeech (16.4%) must clarify whether the ratios are computed as the theoretical codelength (-∑ log₂ p_i) or as the actual output length after running a concrete arithmetic coder. If the former, any finite-precision, termination, or context overhead should be quantified and shown not to alter the superiority over PNG/FLAC baselines.
minor comments (1)
  1. [Abstract and discussion] The abstract promises novel insights into scaling laws, tokenization, and in-context learning via the compression lens; the corresponding discussion sections would benefit from explicit pointers or short subsections that directly tie each insight back to the compression equivalence.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their positive summary, recognition of the work's significance, and recommendation for minor revision. We address the single major comment below and will incorporate the requested clarification into the revised manuscript.

read point-by-point responses
  1. Referee: [Experiments on cross-modal compression] The experimental section reporting the Chinchilla 70B results on ImageNet patches (43.4%) and LibriSpeech (16.4%) must clarify whether the ratios are computed as the theoretical codelength (-∑ log₂ p_i) or as the actual output length after running a concrete arithmetic coder. If the former, any finite-precision, termination, or context overhead should be quantified and shown not to alter the superiority over PNG/FLAC baselines.

    Authors: We thank the referee for highlighting this important point of clarification. The reported ratios (43.4% on ImageNet patches and 16.4% on LibriSpeech) are computed as the theoretical codelength -∑ log₂ p_i using the model's next-token (or next-patch) predictive probabilities, in direct accordance with the prediction-compression equivalence discussed in the paper. We did not run a concrete arithmetic coder for these cross-modal results, as the goal was to evaluate the LLM's raw predictive power as a general-purpose compressor. In the revised manuscript we will explicitly state this in the experimental section. Regarding overheads, standard arithmetic coding incurs only a small constant overhead (typically O(1) bits for termination plus a few bits for finite-precision renormalization and context flushing). For the sample lengths used here (hundreds to thousands of tokens/patches), this overhead is negligible (<0.1% relative to total codelength) and cannot reverse the reported advantage over PNG (58.5%) or FLAC (30.3%). We will add a short paragraph or footnote providing this bound and confirming that the superiority holds under any practical implementation. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical claims rest on direct measurement under established equivalence

full rationale

The paper invokes the long-established equivalence between predictors and lossless compressors (a standard result in information theory, not derived or cited from the authors' prior work here) and then reports measured compression ratios on held-out data from ImageNet and LibriSpeech. These ratios are obtained by applying the fixed Chinchilla model to new token sequences and comparing encoded length to raw size; they are not obtained by fitting parameters to the target ratios, renaming a known pattern, or reducing via self-citation to an unverified premise. The derivation chain is therefore self-contained against external benchmarks and does not collapse to its inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work rests on the established mathematical equivalence between optimal prediction and lossless compression; no new free parameters or invented entities are introduced beyond the pretrained model weights.

axioms (1)
  • standard math Any sufficiently accurate probabilistic predictor can be converted into a lossless compressor via arithmetic coding.
    Invoked in the opening paragraph as the foundational link between language modeling and compression.

pith-pipeline@v0.9.0 · 5535 in / 1271 out tokens · 50992 ms · 2026-05-17T22:31:39.076349+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 18 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Effective Context in Transformers: An Analysis of Fragmentation and Tokenization

    cs.LG 2026-05 unverdicted novelty 8.0

    Fragmentation strictly raises optimal finite-context log-loss on Markov sources while tokenization can make a short token window equivalent to a longer source window under reliability and compression conditions.

  2. Are Flat Minima an Illusion?

    cs.LG 2026-03 unverdicted novelty 8.0

    Flat minima are illusory; generalization is driven by weakness, a reparameterization-invariant measure of compatible completions that predicts performance better than sharpness on MNIST and Fashion-MNIST.

  3. Manifold Steering Reveals the Shared Geometry of Neural Network Representation and Behavior

    cs.LG 2026-05 unverdicted novelty 7.0

    Manifold steering along activation geometry induces behavioral trajectories matching the natural manifold of outputs, while linear steering produces off-manifold unnatural behaviors.

  4. Sequential KV Cache Compression via Probabilistic Language Tries: Beyond the Per-Vector Shannon Limit

    cs.LG 2026-04 unverdicted novelty 7.0

    Sequential KV compression via probabilistic language tries and predictive delta coding achieves 3.3-4.3 bits per token entropy, yielding up to 914x better ratios than TurboQuant even with large overhead.

  5. Hierarchical SVG Tokenization: Learning Compact Visual Programs for Scalable Vector Graphics Modeling

    cs.LG 2026-04 unverdicted novelty 7.0

    HiVG introduces hierarchical SVG tokenization with atomic and segment tokens plus HMN initialization to enable more efficient and stable autoregressive generation of vector graphics programs.

  6. HE-SNR: Uncovering Latent Logic via Entropy for Guiding Mid-Training on SWE-bench

    cs.LG 2026-01 unverdicted novelty 7.0

    HE-SNR is a high-entropy signal-to-noise ratio metric derived from the Entropy Compression Hypothesis to better guide LLM mid-training on complex software engineering benchmarks.

  7. Learn-to-learn on Arbitrary Textual Conditioning: A Hypernetwork-Driven Meta-Gated LLM

    cs.CL 2026-05 unverdicted novelty 6.0

    A hypernetwork generates meta-gating parameters for SwiGLU blocks to let LLMs adapt their nonlinearity to arbitrary textual conditions, outperforming finetuning and meta-learning baselines with reasonable generalizati...

  8. Xiaomi OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation

    cs.CV 2026-04 unverdicted novelty 6.0

    OneVL is the first latent CoT method to exceed explicit CoT accuracy on four driving benchmarks while running at answer-only speed, by supervising latent tokens with a visual world model decoder.

  9. Xiaomi OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation

    cs.CV 2026-04 unverdicted novelty 6.0

    OneVL achieves superior accuracy to explicit chain-of-thought reasoning at answer-only latency by supervising latent tokens with a visual world model decoder that predicts future frames.

  10. Lossless Compression via Chained Lightweight Neural Predictors with Information Inheritance

    cs.IT 2026-04 unverdicted novelty 6.0

    A new chain of lightweight neural predictors with information inheritance achieves near state-of-the-art lossless compression ratios while delivering 1.2-6.3x faster encoding and 2.8-12.3x faster decoding than PAC on GPUs.

  11. Probabilistic Language Tries: A Unified Framework for Compression, Decision Policies, and Execution Reuse

    cs.LG 2026-03 unverdicted novelty 6.0

    Probabilistic language tries unify compression, sequential decision making, and inference caching by making explicit the prefix structure of any generative model over sequences.

  12. Scaling Synthetic Data Creation with 1,000,000,000 Personas

    cs.CL 2024-06 unverdicted novelty 6.0

    A curated set of one billion personas enables scalable, diverse synthetic data generation for LLM training across reasoning, instructions, knowledge, NPCs, and tools.

  13. Model Capacity Determines Grokking through Competing Memorisation and Generalisation Speeds

    cs.LG 2026-05 unverdicted novelty 5.0

    Grokking emerges near the model size where memorization timescale T_mem(P) intersects generalization timescale T_gen(P) on modular arithmetic.

  14. Efficient Learned Data Compression via Dual-Stream Feature Decoupling

    cs.CL 2026-04 unverdicted novelty 5.0

    A dual-stream decoupler plus hierarchical refiner and parallel pipeline yields state-of-the-art compression ratio and throughput with lowest reported latency and memory in learned data compression.

  15. Efficient compression of neural networks and datasets

    cs.LG 2025-05 unverdicted novelty 5.0

    Refined probabilistic and smooth l0 pruning techniques approximate minimum description length for neural networks, achieving high compression with minimal accuracy loss and empirically verifying better sample efficien...

  16. Hallucination of Multimodal Large Language Models: A Survey

    cs.CV 2024-04 accept novelty 5.0

    The survey organizes causes of hallucinations in MLLMs, reviews evaluation benchmarks and metrics, and outlines mitigation approaches plus open questions.

  17. A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions

    cs.CL 2023-11 unverdicted novelty 5.0

    The paper surveys hallucination in LLMs with an innovative taxonomy, factors, detection methods, benchmarks, mitigation strategies, and open research directions.

  18. The Rhetoric of Machine Learning

    cs.LG 2026-04 unverdicted novelty 4.0

    Machine learning is inherently rhetorical and is often deployed as 'manipulation as a service' in business models.

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages · cited by 17 Pith papers · 8 internal anchors

  1. [1]

    On the Opportunities and Risks of Foundation Models

    Rishi Bommasani et al. On the opportunities and risks of foundation models. arXiv:2108.07258,

  2. [2]

    Sparks of Artificial General Intelligence: Early experiments with GPT-4

    Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott M. Lundberg, Harsha Nori, Hamid Palangi, Marco Túlio Ribeiro, and Yi Zhang. Sparks of artificial general intelligence: Early experiments with GPT-4. arXiv:2303.12712,

  3. [3]

    Aydar Bulatov, Yuri Kuratov, and Mikhail S. Burtsev. Scaling transformer to 1m tokens and beyond with RMT. arXiv:2304.11062,

  4. [4]

    A Survey of Model Compression and Acceleration for Deep Neural Networks

    Yu Cheng, Duo Wang, Pan Zhou, and Tao Zhang. A survey of model compression and acceleration for deep neural networks. arXiv:1710.09282,

  5. [5]

    Syntactically Informed Text Compression with Recurrent Neural Networks

    URL https://xiph.org/flac. David Cox. Syntactically informed text compression with recurrent neural networks. arXiv:1608.02893,

  6. [6]

    Asymmetric numeral systems

    Jarek Duda. Asymmetric numeral systems. arXiv:0902.0271,

  7. [7]

    In-context autoencoder for context compression in a large language model

    Tao Ge, Jing Hu, Xun Wang, Si-Qing Chen, and Furu Wei. In-context autoencoder for context compression in a large language model. arXiv:2307.06945,

  8. [8]

    Dzip: Improved general- purpose lossless compression based on novel neural network modeling

    10 Published as a conference paper at ICLR 2024 Mohit Goyal, Kedar Tatwawadi, Shubham Chandak, and Idoia Ochoa. Dzip: Improved general- purpose lossless compression based on novel neural network modeling. In DCC,

  9. [9]

    Training Compute-Optimal Large Language Models

    Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, et al. Training compute-optimal large language models. arXiv:2203.15556,

  10. [10]

    Scaling Laws for Neural Language Models

    Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv:2001.08361,

  11. [11]

    TRACE: A fast transformer-based general- purpose lossless compressor

    11 Published as a conference paper at ICLR 2024 Yu Mao, Yufei Cui, Tei-Wei Kuo, and Chun Jason Xue. TRACE: A fast transformer-based general- purpose lossless compressor. In WWW,

  12. [12]

    , Xia, F

    Suvir Mirchandani, Fei Xia, Pete Florence, Brian Ichter, Danny Driess, Montserrat Gonzalez Arenas, Kanishka Rao, Dorsa Sadigh, and Andy Zeng. Large language models as general pattern machines. arXiv:2307.04721,

  13. [13]

    Scaling Language Models: Methods, Analysis & Insights from Training Gopher

    Jack W. Rae et al. Scaling language models: Methods, analysis & insights from training gopher. arXiv:2112.11446,

  14. [14]

    Neural machine translation of rare words with subword units

    12 Published as a conference paper at ICLR 2024 Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword units. In ACL (1),

  15. [15]

    LLaMA: Open and Efficient Foundation Language Models

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, et al. Llama: Open and efficient foundation language models. arXiv:2302.13971, 2023a. Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton-Ferrer, Moya Chen, Guillem...

  16. [16]

    Llmzip: Lossless text compression using large language models

    Chandra Shekhara Kaushik Valmeekam, Krishna Narayanan, Dileep Kalathil, Jean-François Cham- berland, and Srinivas Shakkottai. Llmzip: Lossless text compression using large language models. arXiv:2306.04050,

  17. [17]

    13 Published as a conference paper at ICLR 2024 A A RITHMETIC CODING Here we provide a step-by-step explanation of the arithmetic encoding example visualized in Fig

  18. [18]

    Concretely, this yields the binary sequence

    in half until it is fully contained in I. Concretely, this yields the binary sequence. • b0 → [0, 0.5) • b01 → [0.25, 0.5) • b010 → [0.25, 0.375) • b0101 → [0.3125, 0.375) • b01010 → [0.3125, 0.34375) • b010101 → [0.328125, 0.34375) • b0101010 → [0.328125, 0.3359375) As [0.328125, 0.3359375) is fully contained in I = [0.322, 0341), the compressed output i...

  19. [19]

    The Transformer models we trained specifically on enwik do not use any tokenization, except in Section 3.6

    Accordingly, we renormalize the top-k log-probabilities. The Transformer models we trained specifically on enwik do not use any tokenization, except in Section 3.6. The reasoning above also holds, except that our models returned the full distribution over tokens, and not only the top-k. C A DDITIONAL RESULTS Fig. C.1, Fig. 3 and Fig. C.3 show data autoreg...

  20. [20]

    Former Lord Chancellors and holders of other high judicial office may also sit as Law Lords under the Appellate Jurisdiction Act, although in practice this right is infrequently exercised. After the coming into force of the Constitutional Reform Act 2005, the Lords of Appeal in Ordinary will become judges of the Supreme Court of the United Kingdom and wil...