Compute Optimal Tokenization

Alisa Liu; Artidoro Pagnoni; Gargi Ghosh; Luke Zettlemoyer; Margaret Li; Mike Lewis; Sachin Mehta; Srini Iyer; Tomasz Limisiewicz

arxiv: 2605.01188 · v2 · pith:YK5CILY5new · submitted 2026-05-02 · 💻 cs.CL

Compute Optimal Tokenization

Tomasz Limisiewicz , Artidoro Pagnoni , Srini Iyer , Mike Lewis , Sachin Mehta , Alisa Liu , Margaret Li , Gargi Ghosh

show 1 more author

Luke Zettlemoyer

This is my paper

Pith reviewed 2026-05-09 15:23 UTC · model grok-4.3

classification 💻 cs.CL

keywords scaling lawstokenizationcompute optimallanguage modelscompression ratedata efficiencyBPE

0 comments

The pith

In compute-optimal regimes, language model parameter counts scale with the byte volume of data rather than the number of tokens.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The work tests how the average bytes of text per token shapes the relationships among model size, data volume, and total compute. By training nearly a thousand models with adjustable compression rates, the authors isolate the effect of token granularity on scaling behavior. They find that the optimal number of parameters grows linearly with bytes of training data, overturning the token-based scaling rule that has guided recent model design. The best compression rate itself falls as compute budgets rise, so larger training runs favor finer tokens. These patterns appear across both the experimental tokenizer and standard subword methods, and they hold for languages beyond English.

Core claim

In compute-optimal configurations, model parameter counts scale proportionally to data size measured in bytes, not in tokens as commonly perceived. The optimal compression rate differs from the one obtained with BPE and decreases with compute. These findings generalize to both latent and subword tokenization as well as to languages other than English.

What carries the argument

The compression rate (average bytes of text per token), varied continuously by training latent-tokenized BLT models that decouple token granularity from the language model itself.

Load-bearing premise

That the scaling patterns observed with controllable latent tokenization will hold for ordinary subword tokenizers at scales beyond the 7B-parameter experiments.

What would settle it

Train matched sets of models with a fixed BPE tokenizer across a range of sizes and data volumes, then check whether the parameter count that minimizes loss per byte continues to follow the reported linear relationship.

read the original abstract

Scaling laws enable the optimal selection of data amount and language model size, yet the impact of the data unit, the token, on this relationship remains underexplored. In this work, we systematically investigate how the information granularity of tokens, controlled by the compression rate (i.e., average bytes of text per token), affects scaling trends. We train 988 latent tokenized models (BLT) ranging from 50M to 7B parameters that enable setting the desired compression rate. This flexibility allows us to study the role of compression rate well beyond 4.57 bytes per token obtained with a popular BPE tokenizer. Our experiments reveal that in compute-optimal configurations, model parameter counts scale proportionally to data size measured in bytes, not in tokens as commonly perceived (Kaplan et al., 2020; Hoffmann et al., 2022). Furthermore, we discover that the optimal compression rate differs from the one obtained with BPE and decreases with compute. These findings generalize to both latent and subword tokenization, as well as to languages other than English, guiding language model developers on tokenization scheme selection for maximal compute efficiency.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper finds that compute-optimal model size scales with bytes of data rather than tokens, and that best compression rate falls as compute grows, based on nearly a thousand BLT runs.

read the letter

The central claim is that in optimal training, parameter count tracks bytes of data, not token count, and that the best token compression rate gets lower as you scale compute. They reach this by training 988 BLT models from 50M to 7B parameters where compression rate can be set independently of the core transformer. That flexibility lets them test rates well above and below standard BPE, map the scaling surface, and check non-English data as well. The volume of runs and the direct manipulation of information density per token are the parts that actually move the needle beyond prior scaling-law work. The decreasing optimal compression trend is a practical takeaway for anyone picking a tokenizer for large runs. The soft spot is the transfer from BLT to ordinary fixed subword tokenizers. The abstract states the byte-scaling and compression trends hold for subword cases too, yet gives no counts, ranges, or controls for those runs. If the relation depends on the variable-latent design inside BLT, it may not apply to the BPE pipelines used in practice. The 7B ceiling also leaves open whether the pattern continues at larger scale. This work is aimed at people who train or tune large models and need concrete rules for tokenizer choice. A reader who cares about scaling laws or efficiency will find usable empirical maps even if they treat the subword generalization as needing more data. It deserves peer review because the experimental scale is real and the questions are actionable; a referee can check the subword controls and see how far the byte proportionality actually travels.

Referee Report

3 major / 2 minor

Summary. The paper trains 988 BLT (latent tokenized) models from 50M to 7B parameters with controllable compression rates to study how token granularity affects scaling. It claims that compute-optimal model size scales linearly with data measured in bytes rather than tokens, that optimal compression rate decreases with compute budget, and that these relations generalize to standard subword tokenizers (e.g., BPE) and non-English languages.

Significance. If the byte-based scaling relation holds beyond the BLT testbed, the work would revise the token-centric assumptions in Kaplan et al. (2020) and Hoffmann et al. (2022), offering practical guidance on tokenizer selection. The scale of 988 models across a wide compression range is a clear empirical strength, enabling direct observation of compression effects that fixed-vocabulary pipelines cannot easily isolate.

major comments (3)

[Abstract] Abstract: The central claim that 'these findings generalize to both latent and subword tokenization' is asserted without any reported model counts, parameter ranges, compression rates tested, or controls for the subword experiments. Because the headline revision to N ∝ bytes rests on this transfer from BLT (which learns compression inside the forward pass) to fixed BPE pipelines, the absence of these details makes the load-bearing generalization impossible to assess.
[Methods / Experimental setup] Experimental setup (implied by the 988-model study): No information is given on the procedure used to identify compute-optimal configurations, the functional form fitted to the scaling data, baseline comparisons, or statistical tests for the reported trends. With results drawn from 988 runs, the lack of these controls prevents evaluation of whether the byte proportionality is robust or sensitive to post-hoc choices.
[Results] Results on optimal compression: The claim that optimal compression rate 'decreases with compute' is presented as a discovery, yet no equation, table, or figure reference shows the fitted dependence or its uncertainty; without this, it is unclear whether the trend is independent of the BLT architecture or an artifact of the latent-token design.

minor comments (2)

[Abstract] The abstract states generalization 'to languages other than English' but supplies no supporting counts or figures; a brief summary table or sentence in the main text would clarify the scope.
[Introduction] Notation for 'compression rate' (bytes per token) is introduced without an early formal definition or relation to standard BPE metrics; adding this would aid readers unfamiliar with BLT.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thoughtful and constructive comments. We address each major comment below and will revise the manuscript to improve clarity and completeness where the concerns are valid.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that 'these findings generalize to both latent and subword tokenization' is asserted without any reported model counts, parameter ranges, compression rates tested, or controls for the subword experiments. Because the headline revision to N ∝ bytes rests on this transfer from BLT (which learns compression inside the forward pass) to fixed BPE pipelines, the absence of these details makes the load-bearing generalization impossible to assess.

Authors: We agree that the abstract would be strengthened by including specific details on the subword experiments. The main text reports additional runs with fixed BPE tokenizers to support the generalization claim, but these were not quantified in the abstract. In the revision we will update the abstract to state the number of subword models, their parameter ranges, and the compression rates tested, making the transfer from BLT to standard pipelines explicit and easier to evaluate. revision: yes
Referee: [Methods / Experimental setup] Experimental setup (implied by the 988-model study): No information is given on the procedure used to identify compute-optimal configurations, the functional form fitted to the scaling data, baseline comparisons, or statistical tests for the reported trends. With results drawn from 988 runs, the lack of these controls prevents evaluation of whether the byte proportionality is robust or sensitive to post-hoc choices.

Authors: The manuscript outlines the 988-model study and scaling analysis, but we acknowledge that the exact procedure for selecting compute-optimal points, the functional form of the fitted relations, baseline comparisons, and statistical tests are not described with sufficient precision. We will revise the methods section to add these details, including the specific scaling-law equation, how optimal configurations were identified from the runs, and any robustness checks performed. revision: yes
Referee: [Results] Results on optimal compression: The claim that optimal compression rate 'decreases with compute' is presented as a discovery, yet no equation, table, or figure reference shows the fitted dependence or its uncertainty; without this, it is unclear whether the trend is independent of the BLT architecture or an artifact of the latent-token design.

Authors: The decreasing trend in optimal compression rate with compute is illustrated via plots of optimal configurations across compute budgets in the results section. We agree that an explicit functional form and uncertainty quantification would make the claim more rigorous and help address whether it is architecture-specific. In the revision we will add the fitted dependence (including the equation and uncertainty) to the text, reference it from the relevant figure, and discuss its implications for the BLT design. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical observations from model training

full rationale

The paper is a purely empirical study that trains 988 BLT models (50M–7B parameters) with controllable compression rates and directly observes scaling trends in compute-optimal regimes. The central claims—that optimal parameter count scales with bytes rather than tokens, and that optimal compression decreases with compute—are presented as experimental findings rather than derived via any mathematical chain, fitted parameter renamed as prediction, or self-referential definition. No equations or derivations are shown that reduce the reported relations to their inputs by construction, and the cited prior work (Kaplan et al., Hoffmann et al.) is external rather than a load-bearing self-citation. The generalization to subword tokenizers is stated as an extension of the observed trends but supplies no circular reduction. The derivation is therefore self-contained against the experimental data.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Central claim rests on empirical outcomes from training 988 BLT models across compression rates; no explicit free parameters, axioms, or invented entities are detailed in the abstract.

pith-pipeline@v0.9.0 · 5519 in / 957 out tokens · 48882 ms · 2026-05-09T15:23:12.731110+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Executable Boundary Contracts for Sound Event Traces
cs.LO 2026-05 unverdicted novelty 6.0 partial

Defines executable boundary contracts for sound event traces using an STL-embeddable Boolean fragment plus interval and duration clauses, then evaluates them on speech and soundscape data where they disagree with stan...