pith. sign in
Pith Number

pith:YK5CILY5

pith:2026:YK5CILY5R3OBKUTLDZRQ6XEY2S
not attested not anchored not stored refs pending

Compute Optimal Tokenization

Alisa Liu, Artidoro Pagnoni, Gargi Ghosh, Luke Zettlemoyer, Margaret Li, Mike Lewis, Sachin Mehta, Srini Iyer, Tomasz Limisiewicz

In compute-optimal regimes, language model parameter counts scale with the byte volume of data rather than the number of tokens.

arxiv:2605.01188 v2 · 2026-05-02 · cs.CL

Add to your LaTeX paper
\usepackage{pith}
\pithnumber{YK5CILY5R3OBKUTLDZRQ6XEY2S}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp
2 Internet Archive
3 Author claim open · sign in to claim
4 Citations open
5 Replications open
Portable graph bundle live · download bundle · merged state
The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

in compute-optimal configurations, model parameter counts scale proportionally to data size measured in bytes, not in tokens as commonly perceived (Kaplan et al., 2020; Hoffmann et al., 2022)

C2weakest assumption

That the behavior of latent tokenized BLT models generalizes to standard subword tokenizers and that the observed scaling trends extend beyond the tested range up to 7B parameters.

C3one line summary

Compute-optimal language models require parameter count to scale with data bytes rather than tokens, with optimal token compression rate decreasing as compute budget grows.

Cited by

1 paper in Pith

Receipt and verification
First computed 2026-05-27T02:06:14.186783Z
Builder pith-number-builder-2026-05-17-v1
Signature Pith Ed25519 (pith-v1-2026-05) · public key
Schema pith-number/v1.0

Canonical hash

c2ba242f1d8edc15526b1e630f5c98d48a21f4d659242a900ec11432447ec65e

Aliases

arxiv: 2605.01188 · arxiv_version: 2605.01188v2 · doi: 10.48550/arxiv.2605.01188 · pith_short_12: YK5CILY5R3OB · pith_short_16: YK5CILY5R3OBKUTL · pith_short_8: YK5CILY5
Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/YK5CILY5R3OBKUTLDZRQ6XEY2S \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: c2ba242f1d8edc15526b1e630f5c98d48a21f4d659242a900ec11432447ec65e
Canonical record JSON
{
  "metadata": {
    "abstract_canon_sha256": "e7fe2f567d44a77bb36cdea77cec8567704cde4940c5a000ac85c1215ff375d4",
    "cross_cats_sorted": [],
    "license": "http://creativecommons.org/licenses/by/4.0/",
    "primary_cat": "cs.CL",
    "submitted_at": "2026-05-02T01:53:22Z",
    "title_canon_sha256": "2ba189736d0694a11dca2c844800d19f7f4888957472955e501cd91a3fcce290"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2605.01188",
    "kind": "arxiv",
    "version": 2
  }
}