arxiv: 2605.01188 · v1 · submitted 2026-05-02 · 💻 cs.CL

Recognition: unknown

Compute Optimal Tokenization

Tomasz Limisiewicz , Artidoro Pagnoni , Srini Iyer , Mike Lewis , Sachin Mehta , Alisa Liu , Margaret Li , Gargi Ghosh

show 1 more author

Luke Zettlemoyer

Authors on Pith no claims yet

Pith reviewed 2026-05-09 15:23 UTC · model grok-4.3

classification 💻 cs.CL

keywords scaling lawstokenizationcompute optimallanguage modelscompression ratedata efficiencyBPE

0 comments

The pith

In compute-optimal regimes, language model parameter counts scale with the byte volume of data rather than the number of tokens.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The work tests how the average bytes of text per token shapes the relationships among model size, data volume, and total compute. By training nearly a thousand models with adjustable compression rates, the authors isolate the effect of token granularity on scaling behavior. They find that the optimal number of parameters grows linearly with bytes of training data, overturning the token-based scaling rule that has guided recent model design. The best compression rate itself falls as compute budgets rise, so larger training runs favor finer tokens. These patterns appear across both the experimental tokenizer and standard subword methods, and they hold for languages beyond English.

Core claim

In compute-optimal configurations, model parameter counts scale proportionally to data size measured in bytes, not in tokens as commonly perceived. The optimal compression rate differs from the one obtained with BPE and decreases with compute. These findings generalize to both latent and subword tokenization as well as to languages other than English.

What carries the argument

The compression rate (average bytes of text per token), varied continuously by training latent-tokenized BLT models that decouple token granularity from the language model itself.

Load-bearing premise

That the scaling patterns observed with controllable latent tokenization will hold for ordinary subword tokenizers at scales beyond the 7B-parameter experiments.

What would settle it

Train matched sets of models with a fixed BPE tokenizer across a range of sizes and data volumes, then check whether the parameter count that minimizes loss per byte continues to follow the reported linear relationship.

read the original abstract

Scaling laws enable the optimal selection of data amount and language model size, yet the impact of the data unit, the token, on this relationship remains underexplored. In this work, we systematically investigate how the information granularity of tokens, controlled by the compression rate (i.e., average bytes of text per token), affects scaling trends. We train 988 latent tokenized models (BLT) ranging from 50M to 7B parameters that enable setting the desired compression rate. This flexibility allows us to study the role of compression rate well beyond 4.57 bytes per token obtained with a popular BPE tokenizer. Our experiments reveal that in compute-optimal configurations, model parameter counts scale proportionally to data size measured in bytes, not in tokens as commonly perceived (Kaplan et al., 2020; Hoffmann et al., 2022). Furthermore, we discover that the optimal compression rate differs from the one obtained with BPE and decreases with compute. These findings generalize to both latent and subword tokenization, as well as to languages other than English, guiding language model developers on tokenization scheme selection for maximal compute efficiency.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper finds that compute-optimal model size scales with bytes of data rather than tokens, and that best compression rate falls as compute grows, based on nearly a thousand BLT runs.

read the letter

The central claim is that in optimal training, parameter count tracks bytes of data, not token count, and that the best token compression rate gets lower as you scale compute. They reach this by training 988 BLT models from 50M to 7B parameters where compression rate can be set independently of the core transformer. That flexibility lets them test rates well above and below standard BPE, map the scaling surface, and check non-English data as well. The volume of runs and the direct manipulation of information density per token are the parts that actually move the needle beyond prior scaling-law work. The decreasing optimal compression trend is a practical takeaway for anyone picking a tokenizer for large runs. The soft spot is the transfer from BLT to ordinary fixed subword tokenizers. The abstract states the byte-scaling and compression trends hold for subword cases too, yet gives no counts, ranges, or controls for those runs. If the relation depends on the variable-latent design inside BLT, it may not apply to the BPE pipelines used in practice. The 7B ceiling also leaves open whether the pattern continues at larger scale. This work is aimed at people who train or tune large models and need concrete rules for tokenizer choice. A reader who cares about scaling laws or efficiency will find usable empirical maps even if they treat the subword generalization as needing more data. It deserves peer review because the experimental scale is real and the questions are actionable; a referee can check the subword controls and see how far the byte proportionality actually travels.

Referee Report

3 major / 2 minor

Summary. The paper trains 988 BLT (latent tokenized) models from 50M to 7B parameters with controllable compression rates to study how token granularity affects scaling. It claims that compute-optimal model size scales linearly with data measured in bytes rather than tokens, that optimal compression rate decreases with compute budget, and that these relations generalize to standard subword tokenizers (e.g., BPE) and non-English languages.

Significance. If the byte-based scaling relation holds beyond the BLT testbed, the work would revise the token-centric assumptions in Kaplan et al. (2020) and Hoffmann et al. (2022), offering practical guidance on tokenizer selection. The scale of 988 models across a wide compression range is a clear empirical strength, enabling direct observation of compression effects that fixed-vocabulary pipelines cannot easily isolate.

major comments (3)

[Abstract] Abstract: The central claim that 'these findings generalize to both latent and subword tokenization' is asserted without any reported model counts, parameter ranges, compression rates tested, or controls for the subword experiments. Because the headline revision to N ∝ bytes rests on this transfer from BLT (which learns compression inside the forward pass) to fixed BPE pipelines, the absence of these details makes the load-bearing generalization impossible to assess.
[Methods / Experimental setup] Experimental setup (implied by the 988-model study): No information is given on the procedure used to identify compute-optimal configurations, the functional form fitted to the scaling data, baseline comparisons, or statistical tests for the reported trends. With results drawn from 988 runs, the lack of these controls prevents evaluation of whether the byte proportionality is robust or sensitive to post-hoc choices.
[Results] Results on optimal compression: The claim that optimal compression rate 'decreases with compute' is presented as a discovery, yet no equation, table, or figure reference shows the fitted dependence or its uncertainty; without this, it is unclear whether the trend is independent of the BLT architecture or an artifact of the latent-token design.

minor comments (2)

[Abstract] The abstract states generalization 'to languages other than English' but supplies no supporting counts or figures; a brief summary table or sentence in the main text would clarify the scope.
[Introduction] Notation for 'compression rate' (bytes per token) is introduced without an early formal definition or relation to standard BPE metrics; adding this would aid readers unfamiliar with BLT.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thoughtful and constructive comments. We address each major comment below and will revise the manuscript to improve clarity and completeness where the concerns are valid.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that 'these findings generalize to both latent and subword tokenization' is asserted without any reported model counts, parameter ranges, compression rates tested, or controls for the subword experiments. Because the headline revision to N ∝ bytes rests on this transfer from BLT (which learns compression inside the forward pass) to fixed BPE pipelines, the absence of these details makes the load-bearing generalization impossible to assess.

Authors: We agree that the abstract would be strengthened by including specific details on the subword experiments. The main text reports additional runs with fixed BPE tokenizers to support the generalization claim, but these were not quantified in the abstract. In the revision we will update the abstract to state the number of subword models, their parameter ranges, and the compression rates tested, making the transfer from BLT to standard pipelines explicit and easier to evaluate. revision: yes
Referee: [Methods / Experimental setup] Experimental setup (implied by the 988-model study): No information is given on the procedure used to identify compute-optimal configurations, the functional form fitted to the scaling data, baseline comparisons, or statistical tests for the reported trends. With results drawn from 988 runs, the lack of these controls prevents evaluation of whether the byte proportionality is robust or sensitive to post-hoc choices.

Authors: The manuscript outlines the 988-model study and scaling analysis, but we acknowledge that the exact procedure for selecting compute-optimal points, the functional form of the fitted relations, baseline comparisons, and statistical tests are not described with sufficient precision. We will revise the methods section to add these details, including the specific scaling-law equation, how optimal configurations were identified from the runs, and any robustness checks performed. revision: yes
Referee: [Results] Results on optimal compression: The claim that optimal compression rate 'decreases with compute' is presented as a discovery, yet no equation, table, or figure reference shows the fitted dependence or its uncertainty; without this, it is unclear whether the trend is independent of the BLT architecture or an artifact of the latent-token design.

Authors: The decreasing trend in optimal compression rate with compute is illustrated via plots of optimal configurations across compute budgets in the results section. We agree that an explicit functional form and uncertainty quantification would make the claim more rigorous and help address whether it is architecture-specific. In the revision we will add the fitted dependence (including the equation and uncertainty) to the text, reference it from the relevant figure, and discuss its implications for the BLT design. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical observations from model training

full rationale

The paper is a purely empirical study that trains 988 BLT models (50M–7B parameters) with controllable compression rates and directly observes scaling trends in compute-optimal regimes. The central claims—that optimal parameter count scales with bytes rather than tokens, and that optimal compression decreases with compute—are presented as experimental findings rather than derived via any mathematical chain, fitted parameter renamed as prediction, or self-referential definition. No equations or derivations are shown that reduce the reported relations to their inputs by construction, and the cited prior work (Kaplan et al., Hoffmann et al.) is external rather than a load-bearing self-citation. The generalization to subword tokenizers is stated as an extension of the observed trends but supplies no circular reduction. The derivation is therefore self-contained against the experimental data.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Central claim rests on empirical outcomes from training 988 BLT models across compression rates; no explicit free parameters, axioms, or invented entities are detailed in the abstract.

pith-pipeline@v0.9.0 · 5519 in / 957 out tokens · 48882 ms · 2026-05-09T15:23:12.731110+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

53 extracted references · 27 canonical work pages · 5 internal anchors

[1]

Scaling Laws for Neural Language Models

Kaplan, Jared and McCandlish, Sam and Henighan, Tom and Brown, Tom B. and Chess, Benjamin and Child, Rewon and Gray, Scott and Radford, Alec and Wu, Jeffrey and Amodei, Dario. Scaling Laws for Neural Language Models. arXiv preprint arXiv:2001.08361. 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001
[2]

Training Compute-Optimal Large Language Models

Hoffmann, Jordan and Borgeaud, Sebastian and Mensch, Arthur and Buchatskaya, Elena and Cai, Trevor and Rutherford, Eliza and de Las Casas, Diego and Hendricks, Lisa Anne and Welbl, Johannes and Clark, Aidan and Hennigan, Tom and Noland, Eric and Millican, Katie and van den Driessche, George and Damoc, Bogdan and Guy, Aurelia and Osindero, Simon and Simony...

work page internal anchor Pith review arXiv 2022
[3]

Resolving Discrepancies in Compute-Optimal Scaling of Language Models , volume =

Porian, Tomer and Wortsman, Mitchell and Jitsev, Jenia and Schmidt, Ludwig and Carmon, Yair , booktitle =. Resolving Discrepancies in Compute-Optimal Scaling of Language Models , url =. doi:10.52202/079017-3189 , editor =

work page doi:10.52202/079017-3189
[4]

TMLR , year =

Tim Pearce and Jinyeop Song , title =. TMLR , year =
[5]

Scaling Laws with Vocabulary: Larger Models Deserve Larger Vocabularies

Tao, Chaofan and Liu, Qian and Dou, Longxu and Muennighoff, Niklas and Wan, Zhongwei and Luo, Ping and Lin, Min and Wong, Ngai. Scaling Laws with Vocabulary: Larger Models Deserve Larger Vocabularies. The Thirty-eighth Annual Conference on Neural Information Processing Systems. 2024

2024
[6]

The Thirteenth International Conference on Learning Representations , year=

(Mis)Fitting Scaling Laws: A Survey of Scaling Law Fitting Techniques in Deep Learning , author=. The Thirteenth International Conference on Learning Representations , year=
[7]

2025 , eprint=

Scaling Laws for Code: Every Programming Language Matters , author=. 2025 , eprint=

2025
[8]

2026 , eprint=

ATLAS: Adaptive Transfer Scaling Laws for Multilingual Pretraining, Finetuning, and Decoding the Curse of Multilinguality , author=. 2026 , eprint=

2026
[9]

Scaling Laws for Multilingual Language Models

He, Yifei and Benhaim, Alon and Patra, Barun and Vaddamanu, Praneetha and Ahuja, Sanchit and Chopra, Parul and Chaudhary, Vishrav and Zhao, Han and Song, Xia. Scaling Laws for Multilingual Language Models. Findings of the Association for Computational Linguistics: ACL 2025. 2025. doi:10.18653/v1/2025.findings-acl.221

work page doi:10.18653/v1/2025.findings-acl.221 2025
[10]

Proceedings of the 40th International Conference on Machine Learning , pages =

Scaling Laws for Generative Mixed-Modal Language Models , author =. Proceedings of the 40th International Conference on Machine Learning , pages =. 2023 , editor =

2023
[11]

2025 , eprint=

DataDecide: How to Predict Best Pretraining Data with Small Experiments , author=. 2025 , eprint=

2025
[12]

and Nocedal, Jorge , title =

Liu, Dong C. and Nocedal, Jorge , title =. Math. Program. , month = aug, pages =. 1989 , issue_date =

1989
[13]

Algorithm 778: L-BFGS-B: Fortran subroutines for large-scale bound-constrained optimization

Zhu, Ciyou and Byrd, Richard H and Lu, Peihuang and Nocedal, Jorge. Algorithm 778: L-BFGS-B: Fortran subroutines for large-scale bound-constrained optimization. ACM Transactions on Mathematical Software (TOMS). 1997

1997
[14]

Datacomp- LM : In search of the next generation of training sets for language models

Li, Jeffrey and Fang, Alex and Smez, Georgios and Albalak, Alon and Mehta, Kaber and Openshaw, Etash and Haber, Louis and Wortsman, Mitchell and Keh, Sedrick and Gadre, Samir Yitzhak and Taori, Rohan and Tian, Shuran and Jitsev, Jenia and Ilharco, Gabriel and Smola, Alexander and Farhadi, Ali and Shankar, Vaishaal and Schmidt, Ludwig and Carmon, Yair and ...

work page arXiv 2024
[15]

2025 , eprint=

FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language , author=. 2025 , eprint=

2025
[16]

ArXiv , year=

No Language Left Behind: Scaling Human-Centered Machine Translation , author=. ArXiv , year=
[17]

Transactions of the Association for Computational Linguistics , year=

The Flores-101 Evaluation Benchmark for Low-Resource and Multilingual Machine Translation , author=. Transactions of the Association for Computational Linguistics , year=
[18]

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Raffel, Colin and Shazeer, Noam and Roberts, Adam and Lee, Katherine and Narang, Sharan and Matena, Michael and Zhou, Yanqi and Li, Wei and Liu, Peter J. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. Journal of Machine Learning Research. 2020

2020
[19]

The stack: 3 tb of permissively licensed source code, 2022

Kocetkov, Denis and Li, Raymond and Allal, Loubna Ben and Li, Jia and Mou, Chenghao and Muñoz Ferrandis, Carlos and Jernite, Yacine and Mitchell, Margaret and Hughes, Sean and Wolf, Thomas and Baez, Dzmitry and Dao, Gravity and Mishra, Mayank and Gu, Alex and Dey, Brendan and Luccioni, Sasha and Vero, Stella Biderman and Muller, Benjamin and de Vries, Har...

work page arXiv 2022
[20]

Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics , year=

HellaSwag: Can a Machine Really Finish Your Sentence? , author=. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics , year=
[21]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Peter Clark and Isaac Cowhey and Oren Etzioni and Tushar Khot and Ashish Sabharwal and Carissa Schoenick and Oyvind Tafjord , title =. arXiv:1803.05457v1 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[22]

ISBN 979-8-89176-251-0

Pagnoni, Artidoro and Pasunuru, Ramakanth and Rodriguez, Pedro and Nguyen, John and Muller, Benjamin and Li, Margaret and Zhou, Chunting and Yu, Lili and Weston, Jason E and Zettlemoyer, Luke and Ghosh, Gargi and Lewis, Mike and Holtzman, Ari and Iyer, Srini. Byte Latent Transformer: Patches Scale Better Than Tokens. Proceedings of the 63rd Annual Meeting...

work page doi:10.18653/v1/2025.acl-long.453 2025
[23]

2025 , eprint=

Dynamic Chunking for End-to-End Hierarchical Sequence Modeling , author=. 2025 , eprint=

2025
[24]

The Thirteenth International Conference on Learning Representations , year=

Hierarchical Autoregressive Transformers: Combining Byte- and Word-Level Processing for Robust, Adaptable Language Models , author=. The Thirteenth International Conference on Learning Representations , year=
[25]

doi: 10.18653/v1/2023.acl-long.353

Nawrot, Piotr and Chorowski, Jan and Lancucki, Adrian and Ponti, Edoardo Maria. Efficient Transformers with Dynamic Token Pooling. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023. doi:10.18653/v1/2023.acl-long.353

work page doi:10.18653/v1/2023.acl-long.353 2023
[26]

The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

SpaceByte: Towards Deleting Tokenization from Large Language Modeling , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=
[27]

BPE -Dropout: Simple and Effective Subword Regularization

Provilkov, Ivan and Emelianenko, Dmitrii and Voita, Elena. BPE -Dropout: Simple and Effective Subword Regularization. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020. doi:10.18653/v1/2020.acl-main.170

work page doi:10.18653/v1/2020.acl-main.170 2020
[28]

How Good is Your Tokenizer? On the Monolingual Performance of Multilingual Language Models

Rust, Phillip and Pfeiffer, Jonas and Vuli \'c , Ivan and Ruder, Sebastian and Gurevych, Iryna. How Good is Your Tokenizer? On the Monolingual Performance of Multilingual Language Models. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volum...

work page doi:10.18653/v1/2021.acl-long.243 2021
[29]

Unpacking Tokenization: Evaluating Text Compression and its Correlation with Model Performance

Goldman, Omer and Caciularu, Avi and Eyal, Matan and Cao, Kris and Szpektor, Idan and Tsarfaty, Reut. Unpacking Tokenization: Evaluating Text Compression and its Correlation with Model Performance. Findings of the Association for Computational Linguistics: ACL 2024. 2024. doi:10.18653/v1/2024.findings-acl.134

work page doi:10.18653/v1/2024.findings-acl.134 2024
[30]

Investigating the Effectiveness of BPE : The Power of Shorter Sequences

Gall \'e , Matthias. Investigating the Effectiveness of BPE : The Power of Shorter Sequences. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2019. doi:10.18653/v1/D19-1141

work page doi:10.18653/v1/d19-1141 2019
[31]

ArXiv , year=

Explaining and Mitigating Crosslingual Tokenizer Inequities , author=. ArXiv , year=
[32]

Schmidt, Varshini Reddy, Haoran Zhang, Alec Alameddine, Omri Uzan, Yuval Pinter, and Chris Tanner

Schmidt, Craig W and Reddy, Varshini and Zhang, Haoran and Alameddine, Alec and Uzan, Omri and Pinter, Yuval and Tanner, Chris. Tokenization Is More Than Compression. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.40

work page doi:10.18653/v1/2024.emnlp-main.40 2024
[33]

Tokenizer Choice For LLM Training: Negligible or Crucial?

Ali, Mehdi and Fromm, Michael and Thellmann, Klaudia and Rutmann, Richard and L. Tokenizer Choice For LLM Training: Negligible or Crucial?. Findings of the Association for Computational Linguistics: NAACL 2024. 2024. doi:10.18653/v1/2024.findings-naacl.247

work page doi:10.18653/v1/2024.findings-naacl.247 2024
[34]

From bytes to ideas: Language modeling with autoregressive u-nets

Mathurin Videau and Badr Youbi Idrissi and Alessandro Leite and Marc Schoenauer and Olivier Teytaud and David Lopez-Paz , title =. arXiv preprint arXiv:2506.14761 , year =

work page arXiv
[35]

doi: 10.18653/v1/P16-1162

Sennrich, Rico and Haddow, Barry and Birch, Alexandra. Neural Machine Translation of Rare Words with Subword Units. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2016. doi:10.18653/v1/P16-1162

work page doi:10.18653/v1/p16-1162 2016
[36]

Smith, Jonathan Hayase, Yejin Choi, Sewoong Oh, and Valentin Hofmann

Liu, Alisa and Hayase, Jonathan and Hofmann, Valentin and Oh, Sewoong and Smith, Noah A. and Choi, Yejin. SuperBPE: Space Travel for Language Models. 2025. arXiv:2503.13423

work page arXiv 2025
[37]

Tokenization Impacts Multilingual Language Modeling: Assessing Vocabulary Allocation and Overlap Across Languages

Limisiewicz, Tomasz and Balhar, Ji r \'i and Mare c ek, David. Tokenization Impacts Multilingual Language Modeling: Assessing Vocabulary Allocation and Overlap Across Languages. Findings of the Association for Computational Linguistics: ACL 2023. 2023. doi:10.18653/v1/2023.findings-acl.350

work page doi:10.18653/v1/2023.findings-acl.350 2023
[38]

MYTE : Morphology-Driven Byte Encoding for Better and Fairer Multilingual Language Modeling

Limisiewicz, Tomasz and Blevins, Terra and Gonen, Hila and Ahia, Orevaoghene and Zettlemoyer, Luke. MYTE : Morphology-Driven Byte Encoding for Better and Fairer Multilingual Language Modeling. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.acl-long.804

work page doi:10.18653/v1/2024.acl-long.804 2024
[39]

MAGNET: Improving the Multilingual Fairness of Language Models with Adaptive Gradient-Based Tokenization , url =

Ahia, Orevaoghene and Kumar, Sachin and Gonen, Hila and Hofmann, Valentin and Limisiewicz, Tomasz and Tsvetkov, Yulia and Smith, Noah A , booktitle =. MAGNET: Improving the Multilingual Fairness of Language Models with Adaptive Gradient-Based Tokenization , url =. doi:10.52202/079017-1514 , editor =

work page doi:10.52202/079017-1514
[40]

Flexitokens: Flexible tokenization for evolving language models

FLEXITOKENS: Flexible Tokenization for Evolving Language Models , author=. arXiv preprint arXiv:2507.12720 , year=

work page internal anchor Pith review arXiv
[41]

Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates

Kudo, Taku. Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2018. doi:10.18653/v1/P18-1007

work page doi:10.18653/v1/p18-1007 2018
[42]

The Llama 3 Herd of Models

Llama Team, AI @ Meta. The Llama 3 Herd of Models. arXiv preprint arXiv:2407.21783. 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[43]

2025 , eprint=

Qwen3 Technical Report , author=. 2025 , eprint=

2025
[44]

EuroLLM: Multilingual Language Models for Europe , year =

Martins, Pedro Henrique and Fernandes, Patrick and Alves, Jo\. EuroLLM: Multilingual Language Models for Europe , year =. doi:10.1016/j.procs.2025.02.260 , journal =

work page doi:10.1016/j.procs.2025.02.260 2025
[45]

and Kaiser, Lukasz and Polosukhin, Illia

Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N. and Kaiser, Lukasz and Polosukhin, Illia. Attention is All You Need. Advances in Neural Information Processing Systems. 2017

2017
[46]

Decoupled Weight Decay Regularization

Loshchilov, Ilya and Hutter, Frank. Decoupled Weight Decay Regularization. International Conference on Learning Representations. 2019

2019
[47]

B y T 5: Towards a Token-Free Future with Pre-trained Byte-to-Byte Models

Xue, Linting and Barua, Aditya and Constant, Noah and Al-Rfou, Rami and Narang, Sharan and Kale, Mihir and Roberts, Adam and Raffel, Colin. B y T 5: Towards a Token-Free Future with Pre-trained Byte-to-Byte Models. Transactions of the Association for Computational Linguistics. 2022. doi:10.1162/tacl_a_00461

work page doi:10.1162/tacl_a_00461 2022
[48]

2024 , eprint=

MambaByte: Token-free Selective State Space Model , author=. 2024 , eprint=

2024
[49]

The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

An Image is Worth 32 Tokens for Reconstruction and Generation , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=
[50]

Proceedings of the 31st International Conference on Neural Information Processing Systems , pages =

van den Oord, Aaron and Vinyals, Oriol and Kavukcuoglu, Koray , title =. Proceedings of the 31st International Conference on Neural Information Processing Systems , pages =. 2017 , isbn =

2017
[51]

Proceedings of the 41st International Conference on Machine Learning , pages =

Scaling Laws for Fine-Grained Mixture of Experts , author =. Proceedings of the 41st International Conference on Machine Learning , pages =. 2024 , editor =

2024
[52]

Do All Languages Cost the Same? Tokenization in the Era of Commercial Language Models

Ahia, Orevaoghene and Kumar, Sachin and Gonen, Hila and Kasai, Jungo and Mortensen, David and Smith, Noah and Tsvetkov, Yulia. Do All Languages Cost the Same? Tokenization in the Era of Commercial Language Models. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023. doi:10.18653/v1/2023.emnlp-main.614

work page doi:10.18653/v1/2023.emnlp-main.614 2023
[53]

Advances in Neural Information Processing Systems , url =

Language Model Tokenizers Introduce Unfairness Between Languages , author =. Advances in Neural Information Processing Systems , url =