Recognition: unknown
Compute Optimal Tokenization
Pith reviewed 2026-05-09 15:23 UTC · model grok-4.3
The pith
In compute-optimal regimes, language model parameter counts scale with the byte volume of data rather than the number of tokens.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
In compute-optimal configurations, model parameter counts scale proportionally to data size measured in bytes, not in tokens as commonly perceived. The optimal compression rate differs from the one obtained with BPE and decreases with compute. These findings generalize to both latent and subword tokenization as well as to languages other than English.
What carries the argument
The compression rate (average bytes of text per token), varied continuously by training latent-tokenized BLT models that decouple token granularity from the language model itself.
Load-bearing premise
That the scaling patterns observed with controllable latent tokenization will hold for ordinary subword tokenizers at scales beyond the 7B-parameter experiments.
What would settle it
Train matched sets of models with a fixed BPE tokenizer across a range of sizes and data volumes, then check whether the parameter count that minimizes loss per byte continues to follow the reported linear relationship.
read the original abstract
Scaling laws enable the optimal selection of data amount and language model size, yet the impact of the data unit, the token, on this relationship remains underexplored. In this work, we systematically investigate how the information granularity of tokens, controlled by the compression rate (i.e., average bytes of text per token), affects scaling trends. We train 988 latent tokenized models (BLT) ranging from 50M to 7B parameters that enable setting the desired compression rate. This flexibility allows us to study the role of compression rate well beyond 4.57 bytes per token obtained with a popular BPE tokenizer. Our experiments reveal that in compute-optimal configurations, model parameter counts scale proportionally to data size measured in bytes, not in tokens as commonly perceived (Kaplan et al., 2020; Hoffmann et al., 2022). Furthermore, we discover that the optimal compression rate differs from the one obtained with BPE and decreases with compute. These findings generalize to both latent and subword tokenization, as well as to languages other than English, guiding language model developers on tokenization scheme selection for maximal compute efficiency.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper trains 988 BLT (latent tokenized) models from 50M to 7B parameters with controllable compression rates to study how token granularity affects scaling. It claims that compute-optimal model size scales linearly with data measured in bytes rather than tokens, that optimal compression rate decreases with compute budget, and that these relations generalize to standard subword tokenizers (e.g., BPE) and non-English languages.
Significance. If the byte-based scaling relation holds beyond the BLT testbed, the work would revise the token-centric assumptions in Kaplan et al. (2020) and Hoffmann et al. (2022), offering practical guidance on tokenizer selection. The scale of 988 models across a wide compression range is a clear empirical strength, enabling direct observation of compression effects that fixed-vocabulary pipelines cannot easily isolate.
major comments (3)
- [Abstract] Abstract: The central claim that 'these findings generalize to both latent and subword tokenization' is asserted without any reported model counts, parameter ranges, compression rates tested, or controls for the subword experiments. Because the headline revision to N ∝ bytes rests on this transfer from BLT (which learns compression inside the forward pass) to fixed BPE pipelines, the absence of these details makes the load-bearing generalization impossible to assess.
- [Methods / Experimental setup] Experimental setup (implied by the 988-model study): No information is given on the procedure used to identify compute-optimal configurations, the functional form fitted to the scaling data, baseline comparisons, or statistical tests for the reported trends. With results drawn from 988 runs, the lack of these controls prevents evaluation of whether the byte proportionality is robust or sensitive to post-hoc choices.
- [Results] Results on optimal compression: The claim that optimal compression rate 'decreases with compute' is presented as a discovery, yet no equation, table, or figure reference shows the fitted dependence or its uncertainty; without this, it is unclear whether the trend is independent of the BLT architecture or an artifact of the latent-token design.
minor comments (2)
- [Abstract] The abstract states generalization 'to languages other than English' but supplies no supporting counts or figures; a brief summary table or sentence in the main text would clarify the scope.
- [Introduction] Notation for 'compression rate' (bytes per token) is introduced without an early formal definition or relation to standard BPE metrics; adding this would aid readers unfamiliar with BLT.
Simulated Author's Rebuttal
We thank the referee for their thoughtful and constructive comments. We address each major comment below and will revise the manuscript to improve clarity and completeness where the concerns are valid.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claim that 'these findings generalize to both latent and subword tokenization' is asserted without any reported model counts, parameter ranges, compression rates tested, or controls for the subword experiments. Because the headline revision to N ∝ bytes rests on this transfer from BLT (which learns compression inside the forward pass) to fixed BPE pipelines, the absence of these details makes the load-bearing generalization impossible to assess.
Authors: We agree that the abstract would be strengthened by including specific details on the subword experiments. The main text reports additional runs with fixed BPE tokenizers to support the generalization claim, but these were not quantified in the abstract. In the revision we will update the abstract to state the number of subword models, their parameter ranges, and the compression rates tested, making the transfer from BLT to standard pipelines explicit and easier to evaluate. revision: yes
-
Referee: [Methods / Experimental setup] Experimental setup (implied by the 988-model study): No information is given on the procedure used to identify compute-optimal configurations, the functional form fitted to the scaling data, baseline comparisons, or statistical tests for the reported trends. With results drawn from 988 runs, the lack of these controls prevents evaluation of whether the byte proportionality is robust or sensitive to post-hoc choices.
Authors: The manuscript outlines the 988-model study and scaling analysis, but we acknowledge that the exact procedure for selecting compute-optimal points, the functional form of the fitted relations, baseline comparisons, and statistical tests are not described with sufficient precision. We will revise the methods section to add these details, including the specific scaling-law equation, how optimal configurations were identified from the runs, and any robustness checks performed. revision: yes
-
Referee: [Results] Results on optimal compression: The claim that optimal compression rate 'decreases with compute' is presented as a discovery, yet no equation, table, or figure reference shows the fitted dependence or its uncertainty; without this, it is unclear whether the trend is independent of the BLT architecture or an artifact of the latent-token design.
Authors: The decreasing trend in optimal compression rate with compute is illustrated via plots of optimal configurations across compute budgets in the results section. We agree that an explicit functional form and uncertainty quantification would make the claim more rigorous and help address whether it is architecture-specific. In the revision we will add the fitted dependence (including the equation and uncertainty) to the text, reference it from the relevant figure, and discuss its implications for the BLT design. revision: yes
Circularity Check
No significant circularity: empirical observations from model training
full rationale
The paper is a purely empirical study that trains 988 BLT models (50M–7B parameters) with controllable compression rates and directly observes scaling trends in compute-optimal regimes. The central claims—that optimal parameter count scales with bytes rather than tokens, and that optimal compression decreases with compute—are presented as experimental findings rather than derived via any mathematical chain, fitted parameter renamed as prediction, or self-referential definition. No equations or derivations are shown that reduce the reported relations to their inputs by construction, and the cited prior work (Kaplan et al., Hoffmann et al.) is external rather than a load-bearing self-citation. The generalization to subword tokenizers is stated as an extension of the observed trends but supplies no circular reduction. The derivation is therefore self-contained against the experimental data.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Scaling Laws for Neural Language Models
Kaplan, Jared and McCandlish, Sam and Henighan, Tom and Brown, Tom B. and Chess, Benjamin and Child, Rewon and Gray, Scott and Radford, Alec and Wu, Jeffrey and Amodei, Dario. Scaling Laws for Neural Language Models. arXiv preprint arXiv:2001.08361. 2020
work page internal anchor Pith review Pith/arXiv arXiv 2001
-
[2]
Training Compute-Optimal Large Language Models
Hoffmann, Jordan and Borgeaud, Sebastian and Mensch, Arthur and Buchatskaya, Elena and Cai, Trevor and Rutherford, Eliza and de Las Casas, Diego and Hendricks, Lisa Anne and Welbl, Johannes and Clark, Aidan and Hennigan, Tom and Noland, Eric and Millican, Katie and van den Driessche, George and Damoc, Bogdan and Guy, Aurelia and Osindero, Simon and Simony...
work page internal anchor Pith review arXiv 2022
-
[3]
Resolving Discrepancies in Compute-Optimal Scaling of Language Models , volume =
Porian, Tomer and Wortsman, Mitchell and Jitsev, Jenia and Schmidt, Ludwig and Carmon, Yair , booktitle =. Resolving Discrepancies in Compute-Optimal Scaling of Language Models , url =. doi:10.52202/079017-3189 , editor =
-
[4]
TMLR , year =
Tim Pearce and Jinyeop Song , title =. TMLR , year =
-
[5]
Scaling Laws with Vocabulary: Larger Models Deserve Larger Vocabularies
Tao, Chaofan and Liu, Qian and Dou, Longxu and Muennighoff, Niklas and Wan, Zhongwei and Luo, Ping and Lin, Min and Wong, Ngai. Scaling Laws with Vocabulary: Larger Models Deserve Larger Vocabularies. The Thirty-eighth Annual Conference on Neural Information Processing Systems. 2024
2024
-
[6]
The Thirteenth International Conference on Learning Representations , year=
(Mis)Fitting Scaling Laws: A Survey of Scaling Law Fitting Techniques in Deep Learning , author=. The Thirteenth International Conference on Learning Representations , year=
-
[7]
2025 , eprint=
Scaling Laws for Code: Every Programming Language Matters , author=. 2025 , eprint=
2025
-
[8]
2026 , eprint=
ATLAS: Adaptive Transfer Scaling Laws for Multilingual Pretraining, Finetuning, and Decoding the Curse of Multilinguality , author=. 2026 , eprint=
2026
-
[9]
Scaling Laws for Multilingual Language Models
He, Yifei and Benhaim, Alon and Patra, Barun and Vaddamanu, Praneetha and Ahuja, Sanchit and Chopra, Parul and Chaudhary, Vishrav and Zhao, Han and Song, Xia. Scaling Laws for Multilingual Language Models. Findings of the Association for Computational Linguistics: ACL 2025. 2025. doi:10.18653/v1/2025.findings-acl.221
-
[10]
Proceedings of the 40th International Conference on Machine Learning , pages =
Scaling Laws for Generative Mixed-Modal Language Models , author =. Proceedings of the 40th International Conference on Machine Learning , pages =. 2023 , editor =
2023
-
[11]
2025 , eprint=
DataDecide: How to Predict Best Pretraining Data with Small Experiments , author=. 2025 , eprint=
2025
-
[12]
and Nocedal, Jorge , title =
Liu, Dong C. and Nocedal, Jorge , title =. Math. Program. , month = aug, pages =. 1989 , issue_date =
1989
-
[13]
Algorithm 778: L-BFGS-B: Fortran subroutines for large-scale bound-constrained optimization
Zhu, Ciyou and Byrd, Richard H and Lu, Peihuang and Nocedal, Jorge. Algorithm 778: L-BFGS-B: Fortran subroutines for large-scale bound-constrained optimization. ACM Transactions on Mathematical Software (TOMS). 1997
1997
-
[14]
Datacomp- LM : In search of the next generation of training sets for language models
Li, Jeffrey and Fang, Alex and Smez, Georgios and Albalak, Alon and Mehta, Kaber and Openshaw, Etash and Haber, Louis and Wortsman, Mitchell and Keh, Sedrick and Gadre, Samir Yitzhak and Taori, Rohan and Tian, Shuran and Jitsev, Jenia and Ilharco, Gabriel and Smola, Alexander and Farhadi, Ali and Shankar, Vaishaal and Schmidt, Ludwig and Carmon, Yair and ...
-
[15]
2025 , eprint=
FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language , author=. 2025 , eprint=
2025
-
[16]
ArXiv , year=
No Language Left Behind: Scaling Human-Centered Machine Translation , author=. ArXiv , year=
-
[17]
Transactions of the Association for Computational Linguistics , year=
The Flores-101 Evaluation Benchmark for Low-Resource and Multilingual Machine Translation , author=. Transactions of the Association for Computational Linguistics , year=
-
[18]
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
Raffel, Colin and Shazeer, Noam and Roberts, Adam and Lee, Katherine and Narang, Sharan and Matena, Michael and Zhou, Yanqi and Li, Wei and Liu, Peter J. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. Journal of Machine Learning Research. 2020
2020
-
[19]
The stack: 3 tb of permissively licensed source code, 2022
Kocetkov, Denis and Li, Raymond and Allal, Loubna Ben and Li, Jia and Mou, Chenghao and Muñoz Ferrandis, Carlos and Jernite, Yacine and Mitchell, Margaret and Hughes, Sean and Wolf, Thomas and Baez, Dzmitry and Dao, Gravity and Mishra, Mayank and Gu, Alex and Dey, Brendan and Luccioni, Sasha and Vero, Stella Biderman and Muller, Benjamin and de Vries, Har...
-
[20]
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics , year=
HellaSwag: Can a Machine Really Finish Your Sentence? , author=. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics , year=
-
[21]
Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
Peter Clark and Isaac Cowhey and Oren Etzioni and Tushar Khot and Ashish Sabharwal and Carissa Schoenick and Oyvind Tafjord , title =. arXiv:1803.05457v1 , year =
work page internal anchor Pith review Pith/arXiv arXiv
-
[22]
Pagnoni, Artidoro and Pasunuru, Ramakanth and Rodriguez, Pedro and Nguyen, John and Muller, Benjamin and Li, Margaret and Zhou, Chunting and Yu, Lili and Weston, Jason E and Zettlemoyer, Luke and Ghosh, Gargi and Lewis, Mike and Holtzman, Ari and Iyer, Srini. Byte Latent Transformer: Patches Scale Better Than Tokens. Proceedings of the 63rd Annual Meeting...
-
[23]
2025 , eprint=
Dynamic Chunking for End-to-End Hierarchical Sequence Modeling , author=. 2025 , eprint=
2025
-
[24]
The Thirteenth International Conference on Learning Representations , year=
Hierarchical Autoregressive Transformers: Combining Byte- and Word-Level Processing for Robust, Adaptable Language Models , author=. The Thirteenth International Conference on Learning Representations , year=
-
[25]
doi: 10.18653/v1/2023.acl-long.353
Nawrot, Piotr and Chorowski, Jan and Lancucki, Adrian and Ponti, Edoardo Maria. Efficient Transformers with Dynamic Token Pooling. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023. doi:10.18653/v1/2023.acl-long.353
-
[26]
The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=
SpaceByte: Towards Deleting Tokenization from Large Language Modeling , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=
-
[27]
BPE -Dropout: Simple and Effective Subword Regularization
Provilkov, Ivan and Emelianenko, Dmitrii and Voita, Elena. BPE -Dropout: Simple and Effective Subword Regularization. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020. doi:10.18653/v1/2020.acl-main.170
-
[28]
How Good is Your Tokenizer? On the Monolingual Performance of Multilingual Language Models
Rust, Phillip and Pfeiffer, Jonas and Vuli \'c , Ivan and Ruder, Sebastian and Gurevych, Iryna. How Good is Your Tokenizer? On the Monolingual Performance of Multilingual Language Models. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volum...
-
[29]
Unpacking Tokenization: Evaluating Text Compression and its Correlation with Model Performance
Goldman, Omer and Caciularu, Avi and Eyal, Matan and Cao, Kris and Szpektor, Idan and Tsarfaty, Reut. Unpacking Tokenization: Evaluating Text Compression and its Correlation with Model Performance. Findings of the Association for Computational Linguistics: ACL 2024. 2024. doi:10.18653/v1/2024.findings-acl.134
-
[30]
Investigating the Effectiveness of BPE : The Power of Shorter Sequences
Gall \'e , Matthias. Investigating the Effectiveness of BPE : The Power of Shorter Sequences. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2019. doi:10.18653/v1/D19-1141
-
[31]
ArXiv , year=
Explaining and Mitigating Crosslingual Tokenizer Inequities , author=. ArXiv , year=
-
[32]
Schmidt, Varshini Reddy, Haoran Zhang, Alec Alameddine, Omri Uzan, Yuval Pinter, and Chris Tanner
Schmidt, Craig W and Reddy, Varshini and Zhang, Haoran and Alameddine, Alec and Uzan, Omri and Pinter, Yuval and Tanner, Chris. Tokenization Is More Than Compression. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.40
-
[33]
Tokenizer Choice For LLM Training: Negligible or Crucial?
Ali, Mehdi and Fromm, Michael and Thellmann, Klaudia and Rutmann, Richard and L. Tokenizer Choice For LLM Training: Negligible or Crucial?. Findings of the Association for Computational Linguistics: NAACL 2024. 2024. doi:10.18653/v1/2024.findings-naacl.247
-
[34]
From bytes to ideas: Language modeling with autoregressive u-nets
Mathurin Videau and Badr Youbi Idrissi and Alessandro Leite and Marc Schoenauer and Olivier Teytaud and David Lopez-Paz , title =. arXiv preprint arXiv:2506.14761 , year =
-
[35]
Sennrich, Rico and Haddow, Barry and Birch, Alexandra. Neural Machine Translation of Rare Words with Subword Units. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2016. doi:10.18653/v1/P16-1162
-
[36]
Smith, Jonathan Hayase, Yejin Choi, Sewoong Oh, and Valentin Hofmann
Liu, Alisa and Hayase, Jonathan and Hofmann, Valentin and Oh, Sewoong and Smith, Noah A. and Choi, Yejin. SuperBPE: Space Travel for Language Models. 2025. arXiv:2503.13423
-
[37]
Limisiewicz, Tomasz and Balhar, Ji r \'i and Mare c ek, David. Tokenization Impacts Multilingual Language Modeling: Assessing Vocabulary Allocation and Overlap Across Languages. Findings of the Association for Computational Linguistics: ACL 2023. 2023. doi:10.18653/v1/2023.findings-acl.350
-
[38]
MYTE : Morphology-Driven Byte Encoding for Better and Fairer Multilingual Language Modeling
Limisiewicz, Tomasz and Blevins, Terra and Gonen, Hila and Ahia, Orevaoghene and Zettlemoyer, Luke. MYTE : Morphology-Driven Byte Encoding for Better and Fairer Multilingual Language Modeling. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.acl-long.804
-
[39]
Ahia, Orevaoghene and Kumar, Sachin and Gonen, Hila and Hofmann, Valentin and Limisiewicz, Tomasz and Tsvetkov, Yulia and Smith, Noah A , booktitle =. MAGNET: Improving the Multilingual Fairness of Language Models with Adaptive Gradient-Based Tokenization , url =. doi:10.52202/079017-1514 , editor =
-
[40]
Flexitokens: Flexible tokenization for evolving language models
FLEXITOKENS: Flexible Tokenization for Evolving Language Models , author=. arXiv preprint arXiv:2507.12720 , year=
work page internal anchor Pith review arXiv
-
[41]
Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates
Kudo, Taku. Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2018. doi:10.18653/v1/P18-1007
-
[42]
Llama Team, AI @ Meta. The Llama 3 Herd of Models. arXiv preprint arXiv:2407.21783. 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[43]
2025 , eprint=
Qwen3 Technical Report , author=. 2025 , eprint=
2025
-
[44]
EuroLLM: Multilingual Language Models for Europe , year =
Martins, Pedro Henrique and Fernandes, Patrick and Alves, Jo\. EuroLLM: Multilingual Language Models for Europe , year =. doi:10.1016/j.procs.2025.02.260 , journal =
-
[45]
and Kaiser, Lukasz and Polosukhin, Illia
Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N. and Kaiser, Lukasz and Polosukhin, Illia. Attention is All You Need. Advances in Neural Information Processing Systems. 2017
2017
-
[46]
Decoupled Weight Decay Regularization
Loshchilov, Ilya and Hutter, Frank. Decoupled Weight Decay Regularization. International Conference on Learning Representations. 2019
2019
-
[47]
B y T 5: Towards a Token-Free Future with Pre-trained Byte-to-Byte Models
Xue, Linting and Barua, Aditya and Constant, Noah and Al-Rfou, Rami and Narang, Sharan and Kale, Mihir and Roberts, Adam and Raffel, Colin. B y T 5: Towards a Token-Free Future with Pre-trained Byte-to-Byte Models. Transactions of the Association for Computational Linguistics. 2022. doi:10.1162/tacl_a_00461
-
[48]
2024 , eprint=
MambaByte: Token-free Selective State Space Model , author=. 2024 , eprint=
2024
-
[49]
The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=
An Image is Worth 32 Tokens for Reconstruction and Generation , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=
-
[50]
Proceedings of the 31st International Conference on Neural Information Processing Systems , pages =
van den Oord, Aaron and Vinyals, Oriol and Kavukcuoglu, Koray , title =. Proceedings of the 31st International Conference on Neural Information Processing Systems , pages =. 2017 , isbn =
2017
-
[51]
Proceedings of the 41st International Conference on Machine Learning , pages =
Scaling Laws for Fine-Grained Mixture of Experts , author =. Proceedings of the 41st International Conference on Machine Learning , pages =. 2024 , editor =
2024
-
[52]
Do All Languages Cost the Same? Tokenization in the Era of Commercial Language Models
Ahia, Orevaoghene and Kumar, Sachin and Gonen, Hila and Kasai, Jungo and Mortensen, David and Smith, Noah and Tsvetkov, Yulia. Do All Languages Cost the Same? Tokenization in the Era of Commercial Language Models. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023. doi:10.18653/v1/2023.emnlp-main.614
-
[53]
Advances in Neural Information Processing Systems , url =
Language Model Tokenizers Introduce Unfairness Between Languages , author =. Advances in Neural Information Processing Systems , url =
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.