pith. machine review for the scientific record. sign in

arxiv: 2605.01188 · v1 · submitted 2026-05-02 · 💻 cs.CL

Recognition: unknown

Compute Optimal Tokenization

Authors on Pith no claims yet

Pith reviewed 2026-05-09 15:23 UTC · model grok-4.3

classification 💻 cs.CL
keywords scaling lawstokenizationcompute optimallanguage modelscompression ratedata efficiencyBPE
0
0 comments X

The pith

In compute-optimal regimes, language model parameter counts scale with the byte volume of data rather than the number of tokens.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The work tests how the average bytes of text per token shapes the relationships among model size, data volume, and total compute. By training nearly a thousand models with adjustable compression rates, the authors isolate the effect of token granularity on scaling behavior. They find that the optimal number of parameters grows linearly with bytes of training data, overturning the token-based scaling rule that has guided recent model design. The best compression rate itself falls as compute budgets rise, so larger training runs favor finer tokens. These patterns appear across both the experimental tokenizer and standard subword methods, and they hold for languages beyond English.

Core claim

In compute-optimal configurations, model parameter counts scale proportionally to data size measured in bytes, not in tokens as commonly perceived. The optimal compression rate differs from the one obtained with BPE and decreases with compute. These findings generalize to both latent and subword tokenization as well as to languages other than English.

What carries the argument

The compression rate (average bytes of text per token), varied continuously by training latent-tokenized BLT models that decouple token granularity from the language model itself.

Load-bearing premise

That the scaling patterns observed with controllable latent tokenization will hold for ordinary subword tokenizers at scales beyond the 7B-parameter experiments.

What would settle it

Train matched sets of models with a fixed BPE tokenizer across a range of sizes and data volumes, then check whether the parameter count that minimizes loss per byte continues to follow the reported linear relationship.

read the original abstract

Scaling laws enable the optimal selection of data amount and language model size, yet the impact of the data unit, the token, on this relationship remains underexplored. In this work, we systematically investigate how the information granularity of tokens, controlled by the compression rate (i.e., average bytes of text per token), affects scaling trends. We train 988 latent tokenized models (BLT) ranging from 50M to 7B parameters that enable setting the desired compression rate. This flexibility allows us to study the role of compression rate well beyond 4.57 bytes per token obtained with a popular BPE tokenizer. Our experiments reveal that in compute-optimal configurations, model parameter counts scale proportionally to data size measured in bytes, not in tokens as commonly perceived (Kaplan et al., 2020; Hoffmann et al., 2022). Furthermore, we discover that the optimal compression rate differs from the one obtained with BPE and decreases with compute. These findings generalize to both latent and subword tokenization, as well as to languages other than English, guiding language model developers on tokenization scheme selection for maximal compute efficiency.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper trains 988 BLT (latent tokenized) models from 50M to 7B parameters with controllable compression rates to study how token granularity affects scaling. It claims that compute-optimal model size scales linearly with data measured in bytes rather than tokens, that optimal compression rate decreases with compute budget, and that these relations generalize to standard subword tokenizers (e.g., BPE) and non-English languages.

Significance. If the byte-based scaling relation holds beyond the BLT testbed, the work would revise the token-centric assumptions in Kaplan et al. (2020) and Hoffmann et al. (2022), offering practical guidance on tokenizer selection. The scale of 988 models across a wide compression range is a clear empirical strength, enabling direct observation of compression effects that fixed-vocabulary pipelines cannot easily isolate.

major comments (3)
  1. [Abstract] Abstract: The central claim that 'these findings generalize to both latent and subword tokenization' is asserted without any reported model counts, parameter ranges, compression rates tested, or controls for the subword experiments. Because the headline revision to N ∝ bytes rests on this transfer from BLT (which learns compression inside the forward pass) to fixed BPE pipelines, the absence of these details makes the load-bearing generalization impossible to assess.
  2. [Methods / Experimental setup] Experimental setup (implied by the 988-model study): No information is given on the procedure used to identify compute-optimal configurations, the functional form fitted to the scaling data, baseline comparisons, or statistical tests for the reported trends. With results drawn from 988 runs, the lack of these controls prevents evaluation of whether the byte proportionality is robust or sensitive to post-hoc choices.
  3. [Results] Results on optimal compression: The claim that optimal compression rate 'decreases with compute' is presented as a discovery, yet no equation, table, or figure reference shows the fitted dependence or its uncertainty; without this, it is unclear whether the trend is independent of the BLT architecture or an artifact of the latent-token design.
minor comments (2)
  1. [Abstract] The abstract states generalization 'to languages other than English' but supplies no supporting counts or figures; a brief summary table or sentence in the main text would clarify the scope.
  2. [Introduction] Notation for 'compression rate' (bytes per token) is introduced without an early formal definition or relation to standard BPE metrics; adding this would aid readers unfamiliar with BLT.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thoughtful and constructive comments. We address each major comment below and will revise the manuscript to improve clarity and completeness where the concerns are valid.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that 'these findings generalize to both latent and subword tokenization' is asserted without any reported model counts, parameter ranges, compression rates tested, or controls for the subword experiments. Because the headline revision to N ∝ bytes rests on this transfer from BLT (which learns compression inside the forward pass) to fixed BPE pipelines, the absence of these details makes the load-bearing generalization impossible to assess.

    Authors: We agree that the abstract would be strengthened by including specific details on the subword experiments. The main text reports additional runs with fixed BPE tokenizers to support the generalization claim, but these were not quantified in the abstract. In the revision we will update the abstract to state the number of subword models, their parameter ranges, and the compression rates tested, making the transfer from BLT to standard pipelines explicit and easier to evaluate. revision: yes

  2. Referee: [Methods / Experimental setup] Experimental setup (implied by the 988-model study): No information is given on the procedure used to identify compute-optimal configurations, the functional form fitted to the scaling data, baseline comparisons, or statistical tests for the reported trends. With results drawn from 988 runs, the lack of these controls prevents evaluation of whether the byte proportionality is robust or sensitive to post-hoc choices.

    Authors: The manuscript outlines the 988-model study and scaling analysis, but we acknowledge that the exact procedure for selecting compute-optimal points, the functional form of the fitted relations, baseline comparisons, and statistical tests are not described with sufficient precision. We will revise the methods section to add these details, including the specific scaling-law equation, how optimal configurations were identified from the runs, and any robustness checks performed. revision: yes

  3. Referee: [Results] Results on optimal compression: The claim that optimal compression rate 'decreases with compute' is presented as a discovery, yet no equation, table, or figure reference shows the fitted dependence or its uncertainty; without this, it is unclear whether the trend is independent of the BLT architecture or an artifact of the latent-token design.

    Authors: The decreasing trend in optimal compression rate with compute is illustrated via plots of optimal configurations across compute budgets in the results section. We agree that an explicit functional form and uncertainty quantification would make the claim more rigorous and help address whether it is architecture-specific. In the revision we will add the fitted dependence (including the equation and uncertainty) to the text, reference it from the relevant figure, and discuss its implications for the BLT design. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical observations from model training

full rationale

The paper is a purely empirical study that trains 988 BLT models (50M–7B parameters) with controllable compression rates and directly observes scaling trends in compute-optimal regimes. The central claims—that optimal parameter count scales with bytes rather than tokens, and that optimal compression decreases with compute—are presented as experimental findings rather than derived via any mathematical chain, fitted parameter renamed as prediction, or self-referential definition. No equations or derivations are shown that reduce the reported relations to their inputs by construction, and the cited prior work (Kaplan et al., Hoffmann et al.) is external rather than a load-bearing self-citation. The generalization to subword tokenizers is stated as an extension of the observed trends but supplies no circular reduction. The derivation is therefore self-contained against the experimental data.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Central claim rests on empirical outcomes from training 988 BLT models across compression rates; no explicit free parameters, axioms, or invented entities are detailed in the abstract.

pith-pipeline@v0.9.0 · 5519 in / 957 out tokens · 48882 ms · 2026-05-09T15:23:12.731110+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

53 extracted references · 27 canonical work pages · 5 internal anchors

  1. [1]

    Scaling Laws for Neural Language Models

    Kaplan, Jared and McCandlish, Sam and Henighan, Tom and Brown, Tom B. and Chess, Benjamin and Child, Rewon and Gray, Scott and Radford, Alec and Wu, Jeffrey and Amodei, Dario. Scaling Laws for Neural Language Models. arXiv preprint arXiv:2001.08361. 2020

  2. [2]

    Training Compute-Optimal Large Language Models

    Hoffmann, Jordan and Borgeaud, Sebastian and Mensch, Arthur and Buchatskaya, Elena and Cai, Trevor and Rutherford, Eliza and de Las Casas, Diego and Hendricks, Lisa Anne and Welbl, Johannes and Clark, Aidan and Hennigan, Tom and Noland, Eric and Millican, Katie and van den Driessche, George and Damoc, Bogdan and Guy, Aurelia and Osindero, Simon and Simony...

  3. [3]

    Resolving Discrepancies in Compute-Optimal Scaling of Language Models , volume =

    Porian, Tomer and Wortsman, Mitchell and Jitsev, Jenia and Schmidt, Ludwig and Carmon, Yair , booktitle =. Resolving Discrepancies in Compute-Optimal Scaling of Language Models , url =. doi:10.52202/079017-3189 , editor =

  4. [4]

    TMLR , year =

    Tim Pearce and Jinyeop Song , title =. TMLR , year =

  5. [5]

    Scaling Laws with Vocabulary: Larger Models Deserve Larger Vocabularies

    Tao, Chaofan and Liu, Qian and Dou, Longxu and Muennighoff, Niklas and Wan, Zhongwei and Luo, Ping and Lin, Min and Wong, Ngai. Scaling Laws with Vocabulary: Larger Models Deserve Larger Vocabularies. The Thirty-eighth Annual Conference on Neural Information Processing Systems. 2024

  6. [6]

    The Thirteenth International Conference on Learning Representations , year=

    (Mis)Fitting Scaling Laws: A Survey of Scaling Law Fitting Techniques in Deep Learning , author=. The Thirteenth International Conference on Learning Representations , year=

  7. [7]

    2025 , eprint=

    Scaling Laws for Code: Every Programming Language Matters , author=. 2025 , eprint=

  8. [8]

    2026 , eprint=

    ATLAS: Adaptive Transfer Scaling Laws for Multilingual Pretraining, Finetuning, and Decoding the Curse of Multilinguality , author=. 2026 , eprint=

  9. [9]

    Scaling Laws for Multilingual Language Models

    He, Yifei and Benhaim, Alon and Patra, Barun and Vaddamanu, Praneetha and Ahuja, Sanchit and Chopra, Parul and Chaudhary, Vishrav and Zhao, Han and Song, Xia. Scaling Laws for Multilingual Language Models. Findings of the Association for Computational Linguistics: ACL 2025. 2025. doi:10.18653/v1/2025.findings-acl.221

  10. [10]

    Proceedings of the 40th International Conference on Machine Learning , pages =

    Scaling Laws for Generative Mixed-Modal Language Models , author =. Proceedings of the 40th International Conference on Machine Learning , pages =. 2023 , editor =

  11. [11]

    2025 , eprint=

    DataDecide: How to Predict Best Pretraining Data with Small Experiments , author=. 2025 , eprint=

  12. [12]

    and Nocedal, Jorge , title =

    Liu, Dong C. and Nocedal, Jorge , title =. Math. Program. , month = aug, pages =. 1989 , issue_date =

  13. [13]

    Algorithm 778: L-BFGS-B: Fortran subroutines for large-scale bound-constrained optimization

    Zhu, Ciyou and Byrd, Richard H and Lu, Peihuang and Nocedal, Jorge. Algorithm 778: L-BFGS-B: Fortran subroutines for large-scale bound-constrained optimization. ACM Transactions on Mathematical Software (TOMS). 1997

  14. [14]

    Datacomp- LM : In search of the next generation of training sets for language models

    Li, Jeffrey and Fang, Alex and Smez, Georgios and Albalak, Alon and Mehta, Kaber and Openshaw, Etash and Haber, Louis and Wortsman, Mitchell and Keh, Sedrick and Gadre, Samir Yitzhak and Taori, Rohan and Tian, Shuran and Jitsev, Jenia and Ilharco, Gabriel and Smola, Alexander and Farhadi, Ali and Shankar, Vaishaal and Schmidt, Ludwig and Carmon, Yair and ...

  15. [15]

    2025 , eprint=

    FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language , author=. 2025 , eprint=

  16. [16]

    ArXiv , year=

    No Language Left Behind: Scaling Human-Centered Machine Translation , author=. ArXiv , year=

  17. [17]

    Transactions of the Association for Computational Linguistics , year=

    The Flores-101 Evaluation Benchmark for Low-Resource and Multilingual Machine Translation , author=. Transactions of the Association for Computational Linguistics , year=

  18. [18]

    Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

    Raffel, Colin and Shazeer, Noam and Roberts, Adam and Lee, Katherine and Narang, Sharan and Matena, Michael and Zhou, Yanqi and Li, Wei and Liu, Peter J. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. Journal of Machine Learning Research. 2020

  19. [19]

    The stack: 3 tb of permissively licensed source code, 2022

    Kocetkov, Denis and Li, Raymond and Allal, Loubna Ben and Li, Jia and Mou, Chenghao and Muñoz Ferrandis, Carlos and Jernite, Yacine and Mitchell, Margaret and Hughes, Sean and Wolf, Thomas and Baez, Dzmitry and Dao, Gravity and Mishra, Mayank and Gu, Alex and Dey, Brendan and Luccioni, Sasha and Vero, Stella Biderman and Muller, Benjamin and de Vries, Har...

  20. [20]

    Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics , year=

    HellaSwag: Can a Machine Really Finish Your Sentence? , author=. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics , year=

  21. [21]

    Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

    Peter Clark and Isaac Cowhey and Oren Etzioni and Tushar Khot and Ashish Sabharwal and Carissa Schoenick and Oyvind Tafjord , title =. arXiv:1803.05457v1 , year =

  22. [22]

    ISBN 979-8-89176-251-0

    Pagnoni, Artidoro and Pasunuru, Ramakanth and Rodriguez, Pedro and Nguyen, John and Muller, Benjamin and Li, Margaret and Zhou, Chunting and Yu, Lili and Weston, Jason E and Zettlemoyer, Luke and Ghosh, Gargi and Lewis, Mike and Holtzman, Ari and Iyer, Srini. Byte Latent Transformer: Patches Scale Better Than Tokens. Proceedings of the 63rd Annual Meeting...

  23. [23]

    2025 , eprint=

    Dynamic Chunking for End-to-End Hierarchical Sequence Modeling , author=. 2025 , eprint=

  24. [24]

    The Thirteenth International Conference on Learning Representations , year=

    Hierarchical Autoregressive Transformers: Combining Byte- and Word-Level Processing for Robust, Adaptable Language Models , author=. The Thirteenth International Conference on Learning Representations , year=

  25. [25]

    doi: 10.18653/v1/2023.acl-long.353

    Nawrot, Piotr and Chorowski, Jan and Lancucki, Adrian and Ponti, Edoardo Maria. Efficient Transformers with Dynamic Token Pooling. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023. doi:10.18653/v1/2023.acl-long.353

  26. [26]

    The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

    SpaceByte: Towards Deleting Tokenization from Large Language Modeling , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

  27. [27]

    BPE -Dropout: Simple and Effective Subword Regularization

    Provilkov, Ivan and Emelianenko, Dmitrii and Voita, Elena. BPE -Dropout: Simple and Effective Subword Regularization. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020. doi:10.18653/v1/2020.acl-main.170

  28. [28]

    How Good is Your Tokenizer? On the Monolingual Performance of Multilingual Language Models

    Rust, Phillip and Pfeiffer, Jonas and Vuli \'c , Ivan and Ruder, Sebastian and Gurevych, Iryna. How Good is Your Tokenizer? On the Monolingual Performance of Multilingual Language Models. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volum...

  29. [29]

    Unpacking Tokenization: Evaluating Text Compression and its Correlation with Model Performance

    Goldman, Omer and Caciularu, Avi and Eyal, Matan and Cao, Kris and Szpektor, Idan and Tsarfaty, Reut. Unpacking Tokenization: Evaluating Text Compression and its Correlation with Model Performance. Findings of the Association for Computational Linguistics: ACL 2024. 2024. doi:10.18653/v1/2024.findings-acl.134

  30. [30]

    Investigating the Effectiveness of BPE : The Power of Shorter Sequences

    Gall \'e , Matthias. Investigating the Effectiveness of BPE : The Power of Shorter Sequences. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2019. doi:10.18653/v1/D19-1141

  31. [31]

    ArXiv , year=

    Explaining and Mitigating Crosslingual Tokenizer Inequities , author=. ArXiv , year=

  32. [32]

    Schmidt, Varshini Reddy, Haoran Zhang, Alec Alameddine, Omri Uzan, Yuval Pinter, and Chris Tanner

    Schmidt, Craig W and Reddy, Varshini and Zhang, Haoran and Alameddine, Alec and Uzan, Omri and Pinter, Yuval and Tanner, Chris. Tokenization Is More Than Compression. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.40

  33. [33]

    Tokenizer Choice For LLM Training: Negligible or Crucial?

    Ali, Mehdi and Fromm, Michael and Thellmann, Klaudia and Rutmann, Richard and L. Tokenizer Choice For LLM Training: Negligible or Crucial?. Findings of the Association for Computational Linguistics: NAACL 2024. 2024. doi:10.18653/v1/2024.findings-naacl.247

  34. [34]

    From bytes to ideas: Language modeling with autoregressive u-nets

    Mathurin Videau and Badr Youbi Idrissi and Alessandro Leite and Marc Schoenauer and Olivier Teytaud and David Lopez-Paz , title =. arXiv preprint arXiv:2506.14761 , year =

  35. [35]

    doi: 10.18653/v1/P16-1162

    Sennrich, Rico and Haddow, Barry and Birch, Alexandra. Neural Machine Translation of Rare Words with Subword Units. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2016. doi:10.18653/v1/P16-1162

  36. [36]

    Smith, Jonathan Hayase, Yejin Choi, Sewoong Oh, and Valentin Hofmann

    Liu, Alisa and Hayase, Jonathan and Hofmann, Valentin and Oh, Sewoong and Smith, Noah A. and Choi, Yejin. SuperBPE: Space Travel for Language Models. 2025. arXiv:2503.13423

  37. [37]

    Tokenization Impacts Multilingual Language Modeling: Assessing Vocabulary Allocation and Overlap Across Languages

    Limisiewicz, Tomasz and Balhar, Ji r \'i and Mare c ek, David. Tokenization Impacts Multilingual Language Modeling: Assessing Vocabulary Allocation and Overlap Across Languages. Findings of the Association for Computational Linguistics: ACL 2023. 2023. doi:10.18653/v1/2023.findings-acl.350

  38. [38]

    MYTE : Morphology-Driven Byte Encoding for Better and Fairer Multilingual Language Modeling

    Limisiewicz, Tomasz and Blevins, Terra and Gonen, Hila and Ahia, Orevaoghene and Zettlemoyer, Luke. MYTE : Morphology-Driven Byte Encoding for Better and Fairer Multilingual Language Modeling. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.acl-long.804

  39. [39]

    MAGNET: Improving the Multilingual Fairness of Language Models with Adaptive Gradient-Based Tokenization , url =

    Ahia, Orevaoghene and Kumar, Sachin and Gonen, Hila and Hofmann, Valentin and Limisiewicz, Tomasz and Tsvetkov, Yulia and Smith, Noah A , booktitle =. MAGNET: Improving the Multilingual Fairness of Language Models with Adaptive Gradient-Based Tokenization , url =. doi:10.52202/079017-1514 , editor =

  40. [40]

    Flexitokens: Flexible tokenization for evolving language models

    FLEXITOKENS: Flexible Tokenization for Evolving Language Models , author=. arXiv preprint arXiv:2507.12720 , year=

  41. [41]

    Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates

    Kudo, Taku. Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2018. doi:10.18653/v1/P18-1007

  42. [42]

    The Llama 3 Herd of Models

    Llama Team, AI @ Meta. The Llama 3 Herd of Models. arXiv preprint arXiv:2407.21783. 2024

  43. [43]

    2025 , eprint=

    Qwen3 Technical Report , author=. 2025 , eprint=

  44. [44]

    EuroLLM: Multilingual Language Models for Europe , year =

    Martins, Pedro Henrique and Fernandes, Patrick and Alves, Jo\. EuroLLM: Multilingual Language Models for Europe , year =. doi:10.1016/j.procs.2025.02.260 , journal =

  45. [45]

    and Kaiser, Lukasz and Polosukhin, Illia

    Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N. and Kaiser, Lukasz and Polosukhin, Illia. Attention is All You Need. Advances in Neural Information Processing Systems. 2017

  46. [46]

    Decoupled Weight Decay Regularization

    Loshchilov, Ilya and Hutter, Frank. Decoupled Weight Decay Regularization. International Conference on Learning Representations. 2019

  47. [47]

    B y T 5: Towards a Token-Free Future with Pre-trained Byte-to-Byte Models

    Xue, Linting and Barua, Aditya and Constant, Noah and Al-Rfou, Rami and Narang, Sharan and Kale, Mihir and Roberts, Adam and Raffel, Colin. B y T 5: Towards a Token-Free Future with Pre-trained Byte-to-Byte Models. Transactions of the Association for Computational Linguistics. 2022. doi:10.1162/tacl_a_00461

  48. [48]

    2024 , eprint=

    MambaByte: Token-free Selective State Space Model , author=. 2024 , eprint=

  49. [49]

    The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

    An Image is Worth 32 Tokens for Reconstruction and Generation , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

  50. [50]

    Proceedings of the 31st International Conference on Neural Information Processing Systems , pages =

    van den Oord, Aaron and Vinyals, Oriol and Kavukcuoglu, Koray , title =. Proceedings of the 31st International Conference on Neural Information Processing Systems , pages =. 2017 , isbn =

  51. [51]

    Proceedings of the 41st International Conference on Machine Learning , pages =

    Scaling Laws for Fine-Grained Mixture of Experts , author =. Proceedings of the 41st International Conference on Machine Learning , pages =. 2024 , editor =

  52. [52]

    Do All Languages Cost the Same? Tokenization in the Era of Commercial Language Models

    Ahia, Orevaoghene and Kumar, Sachin and Gonen, Hila and Kasai, Jungo and Mortensen, David and Smith, Noah and Tsvetkov, Yulia. Do All Languages Cost the Same? Tokenization in the Era of Commercial Language Models. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023. doi:10.18653/v1/2023.emnlp-main.614

  53. [53]

    Advances in Neural Information Processing Systems , url =

    Language Model Tokenizers Introduce Unfairness Between Languages , author =. Advances in Neural Information Processing Systems , url =