Pith Number

pith:XTLBBB6Y

pith:2023:XTLBBB6Y4KTUBXU5EM27RGCOF7

not attested not anchored not stored refs resolved

Language Modeling Is Compression

Anian Ruoss, Christopher Mattern, Elliot Catt, Gr\'egoire Del\'etang, Joel Veness, Jordi Grau-Moya, Laurent Orseau, Li Kevin Wenliang, Marcus Hutter, Matthew Aitchison, Paul-Ambroise Duquenne, Tim Genewein

Large language models trained on text compress images and audio better than specialized tools.

arxiv:2309.10668 v2 · 2023-09-19 · cs.LG · cs.AI · cs.CL · cs.IT · math.IT

Open paper page JSON Open Graph Bundle Merged state What is a Pith Number?

Add to your LaTeX paper

\usepackage{pith}
\pithnumber{XTLBBB6Y4KTUBXU5EM27RGCOF7}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more

Record completeness

1 Bitcoin timestamp

2 Internet Archive

3 Author claim open · sign in to claim

4 Citations open

5 Replications open

✓ Portable graph bundle live · download bundle · merged state

The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

Chinchilla 70B, while trained primarily on text, compresses ImageNet patches to 43.4% and LibriSpeech samples to 16.4% of their raw size, beating domain-specific compressors like PNG (58.5%) or FLAC (30.3%), respectively.

C2weakest assumption

That the predictive distribution produced by the language model can be directly converted into a lossless compression scheme via arithmetic coding without significant overhead or implementation-specific losses that would invalidate the reported ratios.

C3one line summary

Large language models serve as strong general-purpose lossless compressors for text, images, and audio, outperforming domain-specific methods and revealing insights into scaling, tokenization, and in-context learning.

References

20 extracted · 20 resolved · 8 Pith anchors

[1] On the Opportunities and Risks of Foundation Models · arXiv:2108.07258

[2] Sparks of Artificial General Intelligence: Early experiments with GPT-4 · arXiv:2303.12712

[3] Scaling transformer to 1m tokens and beyond with rmt

[4] arXiv preprint arXiv:1710.09282 , year=

[5] Syntactically Informed Text Compression with Recurrent Neural Networks · arXiv:1608.02893

Cited by

18 papers in Pith

Efficient compression of neural networks and datasets

HE-SNR: Uncovering Latent Logic via Entropy for Guiding Mid-Training on SWE-bench

Scaling Synthetic Data Creation with 1,000,000,000 Personas

Are Flat Minima an Illusion?

Probabilistic Language Tries: A Unified Framework for Compression, Decision Policies, and Execution Reuse

Receipt and verification

First computed	2026-05-17T23:38:12.795179Z
Builder	pith-number-builder-2026-05-17-v1
Signature	Pith Ed25519 (`pith-v1-2026-05`) · public key
Schema	pith-number/v1.0

Canonical hash

bcd61087d8e2a740de9d2335f8984e2ffad77aecf94016c81c774f0aefaddc2e

Aliases

arxiv: 2309.10668 · arxiv_version: 2309.10668v2 · doi: 10.48550/arxiv.2309.10668 · pith_short_12: XTLBBB6Y4KTU · pith_short_16: XTLBBB6Y4KTUBXU5 · pith_short_8: XTLBBB6Y

Agent API

Resolver JSON Graph JSON Events JSON Schema Signing key

Verify this Pith Number yourself

curl -sH 'Accept: application/ld+json' https://pith.science/pith/XTLBBB6Y4KTUBXU5EM27RGCOF7 \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: bcd61087d8e2a740de9d2335f8984e2ffad77aecf94016c81c774f0aefaddc2e

Canonical record JSON

{
  "metadata": {
    "abstract_canon_sha256": "def926c0e7ca4abc4365977dcb574ca4dabb545c6b3e14e3b6b81a9cc38c332a",
    "cross_cats_sorted": [
      "cs.AI",
      "cs.CL",
      "cs.IT",
      "math.IT"
    ],
    "license": "http://creativecommons.org/licenses/by/4.0/",
    "primary_cat": "cs.LG",
    "submitted_at": "2023-09-19T14:50:38Z",
    "title_canon_sha256": "6e120a3a5dcd5a2ea8b8e58a3af16ddbf5cf63cc0fa224a78c89c0a65669247e"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2309.10668",
    "kind": "arxiv",
    "version": 2
  }
}