Compute Optimal Tokenization
Pith reviewed 2026-05-09 15:23 UTC · model grok-4.3
The pith
In compute-optimal regimes, language model parameter counts scale with the byte volume of data rather than the number of tokens.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
In compute-optimal configurations, model parameter counts scale proportionally to data size measured in bytes, not in tokens as commonly perceived. The optimal compression rate differs from the one obtained with BPE and decreases with compute. These findings generalize to both latent and subword tokenization as well as to languages other than English.
What carries the argument
The compression rate (average bytes of text per token), varied continuously by training latent-tokenized BLT models that decouple token granularity from the language model itself.
Load-bearing premise
That the scaling patterns observed with controllable latent tokenization will hold for ordinary subword tokenizers at scales beyond the 7B-parameter experiments.
What would settle it
Train matched sets of models with a fixed BPE tokenizer across a range of sizes and data volumes, then check whether the parameter count that minimizes loss per byte continues to follow the reported linear relationship.
read the original abstract
Scaling laws enable the optimal selection of data amount and language model size, yet the impact of the data unit, the token, on this relationship remains underexplored. In this work, we systematically investigate how the information granularity of tokens, controlled by the compression rate (i.e., average bytes of text per token), affects scaling trends. We train 988 latent tokenized models (BLT) ranging from 50M to 7B parameters that enable setting the desired compression rate. This flexibility allows us to study the role of compression rate well beyond 4.57 bytes per token obtained with a popular BPE tokenizer. Our experiments reveal that in compute-optimal configurations, model parameter counts scale proportionally to data size measured in bytes, not in tokens as commonly perceived (Kaplan et al., 2020; Hoffmann et al., 2022). Furthermore, we discover that the optimal compression rate differs from the one obtained with BPE and decreases with compute. These findings generalize to both latent and subword tokenization, as well as to languages other than English, guiding language model developers on tokenization scheme selection for maximal compute efficiency.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper trains 988 BLT (latent tokenized) models from 50M to 7B parameters with controllable compression rates to study how token granularity affects scaling. It claims that compute-optimal model size scales linearly with data measured in bytes rather than tokens, that optimal compression rate decreases with compute budget, and that these relations generalize to standard subword tokenizers (e.g., BPE) and non-English languages.
Significance. If the byte-based scaling relation holds beyond the BLT testbed, the work would revise the token-centric assumptions in Kaplan et al. (2020) and Hoffmann et al. (2022), offering practical guidance on tokenizer selection. The scale of 988 models across a wide compression range is a clear empirical strength, enabling direct observation of compression effects that fixed-vocabulary pipelines cannot easily isolate.
major comments (3)
- [Abstract] Abstract: The central claim that 'these findings generalize to both latent and subword tokenization' is asserted without any reported model counts, parameter ranges, compression rates tested, or controls for the subword experiments. Because the headline revision to N ∝ bytes rests on this transfer from BLT (which learns compression inside the forward pass) to fixed BPE pipelines, the absence of these details makes the load-bearing generalization impossible to assess.
- [Methods / Experimental setup] Experimental setup (implied by the 988-model study): No information is given on the procedure used to identify compute-optimal configurations, the functional form fitted to the scaling data, baseline comparisons, or statistical tests for the reported trends. With results drawn from 988 runs, the lack of these controls prevents evaluation of whether the byte proportionality is robust or sensitive to post-hoc choices.
- [Results] Results on optimal compression: The claim that optimal compression rate 'decreases with compute' is presented as a discovery, yet no equation, table, or figure reference shows the fitted dependence or its uncertainty; without this, it is unclear whether the trend is independent of the BLT architecture or an artifact of the latent-token design.
minor comments (2)
- [Abstract] The abstract states generalization 'to languages other than English' but supplies no supporting counts or figures; a brief summary table or sentence in the main text would clarify the scope.
- [Introduction] Notation for 'compression rate' (bytes per token) is introduced without an early formal definition or relation to standard BPE metrics; adding this would aid readers unfamiliar with BLT.
Simulated Author's Rebuttal
We thank the referee for their thoughtful and constructive comments. We address each major comment below and will revise the manuscript to improve clarity and completeness where the concerns are valid.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claim that 'these findings generalize to both latent and subword tokenization' is asserted without any reported model counts, parameter ranges, compression rates tested, or controls for the subword experiments. Because the headline revision to N ∝ bytes rests on this transfer from BLT (which learns compression inside the forward pass) to fixed BPE pipelines, the absence of these details makes the load-bearing generalization impossible to assess.
Authors: We agree that the abstract would be strengthened by including specific details on the subword experiments. The main text reports additional runs with fixed BPE tokenizers to support the generalization claim, but these were not quantified in the abstract. In the revision we will update the abstract to state the number of subword models, their parameter ranges, and the compression rates tested, making the transfer from BLT to standard pipelines explicit and easier to evaluate. revision: yes
-
Referee: [Methods / Experimental setup] Experimental setup (implied by the 988-model study): No information is given on the procedure used to identify compute-optimal configurations, the functional form fitted to the scaling data, baseline comparisons, or statistical tests for the reported trends. With results drawn from 988 runs, the lack of these controls prevents evaluation of whether the byte proportionality is robust or sensitive to post-hoc choices.
Authors: The manuscript outlines the 988-model study and scaling analysis, but we acknowledge that the exact procedure for selecting compute-optimal points, the functional form of the fitted relations, baseline comparisons, and statistical tests are not described with sufficient precision. We will revise the methods section to add these details, including the specific scaling-law equation, how optimal configurations were identified from the runs, and any robustness checks performed. revision: yes
-
Referee: [Results] Results on optimal compression: The claim that optimal compression rate 'decreases with compute' is presented as a discovery, yet no equation, table, or figure reference shows the fitted dependence or its uncertainty; without this, it is unclear whether the trend is independent of the BLT architecture or an artifact of the latent-token design.
Authors: The decreasing trend in optimal compression rate with compute is illustrated via plots of optimal configurations across compute budgets in the results section. We agree that an explicit functional form and uncertainty quantification would make the claim more rigorous and help address whether it is architecture-specific. In the revision we will add the fitted dependence (including the equation and uncertainty) to the text, reference it from the relevant figure, and discuss its implications for the BLT design. revision: yes
Circularity Check
No significant circularity: empirical observations from model training
full rationale
The paper is a purely empirical study that trains 988 BLT models (50M–7B parameters) with controllable compression rates and directly observes scaling trends in compute-optimal regimes. The central claims—that optimal parameter count scales with bytes rather than tokens, and that optimal compression decreases with compute—are presented as experimental findings rather than derived via any mathematical chain, fitted parameter renamed as prediction, or self-referential definition. No equations or derivations are shown that reduce the reported relations to their inputs by construction, and the cited prior work (Kaplan et al., Hoffmann et al.) is external rather than a load-bearing self-citation. The generalization to subword tokenizers is stated as an extension of the observed trends but supplies no circular reduction. The derivation is therefore self-contained against the experimental data.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 1 Pith paper
-
Executable Boundary Contracts for Sound Event Traces
Defines executable boundary contracts for sound event traces using an STL-embeddable Boolean fragment plus interval and duration clauses, then evaluates them on speech and soundscape data where they disagree with stan...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.