pith. sign in

arxiv: 2605.22705 · v1 · pith:VTOLHSYYnew · submitted 2026-05-21 · 💻 cs.CL

Tokenization with Split Trees

Pith reviewed 2026-05-22 05:43 UTC · model grok-4.3

classification 💻 cs.CL
keywords subword tokenizationsplit treesinteger programminglanguage modelingvocabulary selectioncompressionBPERenyi efficiency
0
0 comments X

The pith

ToaST reduces English token counts by more than 11% versus BPE at vocabularies of 40960 and larger while raising 1.5B model CORE scores.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces ToaST, a tokenization method that first builds a full binary split tree for each pretoken from precomputed byte n-gram counts. It then chooses a vocabulary by solving an integer program that minimizes the number of tokens emitted when inference walks each tree and stops at the first in-vocabulary node. The linear-programming relaxation of this program stays near-integral, so good solutions are found quickly. Experiments on English data show the resulting tokenizers use over 11% fewer tokens than BPE, WordPiece, or UnigramLM at large vocabulary sizes and also improve Renyi efficiency by using fewer single-byte tokens. When 1.5B-parameter language models are trained with these tokenizers, ToaST records the highest CORE score among the tested methods.

Core claim

ToaST greedily splits each pretoken into a full binary tree using precomputed byte n-gram counts independent of any vocabulary. Given a vocabulary, inference recursively descends each split tree and emits the first in-vocabulary node reached on each path. Vocabulary selection is formulated as an integer program that minimizes the total token count over all split trees under this inference procedure, and the LP relaxation is near-integral in practice, yielding provably near-optimal vocabularies.

What carries the argument

The split tree, a full binary tree built greedily on byte n-grams for each pretoken, which supports recursive first-in-vocabulary emission and serves as the objective for the integer program that selects the vocabulary.

If this is right

  • Token counts drop by more than 11% on English text at vocabulary sizes of 40,960 and above, extending effective context length.
  • 1.5B-parameter language models reach the highest CORE score and outperform baselines by 2.6% to 7.6% with significance in two of three comparisons.
  • Common single-byte tokens appear less often, producing a substantial gain in Renyi efficiency.
  • The LP relaxation remains near-integral, so the same optimization procedure scales to practical vocabulary sizes with quadratic training time in the number of split trees.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same split-tree construction could be tried on non-English text by replacing the byte n-gram statistics with those of the target language.
  • Because training time grows quadratically with the number of pretokens, sampling a representative subset of the corpus may be needed before applying ToaST to very large datasets.
  • The reduction in token count may also improve performance on sequence-to-sequence tasks such as machine translation that are sensitive to sequence length.

Load-bearing premise

That the greedy binary splitting of pretokens using precomputed byte n-gram counts, followed by recursive descent to the first in-vocabulary node, produces a compression objective whose integer-program solution yields vocabularies that are near-optimal in actual downstream use.

What would settle it

Measuring token counts and downstream CORE scores for a 1.5B model trained with a ToaST vocabulary chosen at size 40960 on a fresh English corpus and checking whether the reported gains versus BPE still appear.

Figures

Figures reproduced from arXiv: 2605.22705 by Adam Wiemerslage, Chris Tanner, Craig W. Schmidt, Michael Krumdick, Seth Ebner, Varshini Reddy, Yuval Pinter.

Figure 1
Figure 1. Figure 1: Example split tree for ␣Kentucky. context window. The relationship between com￾pression and downstream task performance is less clear. Some studies (Gallé, 2019; Rust et al., 2021; Goldman et al., 2024) find a correlation, while oth￾ers (Schmidt et al., 2024; Ali et al., 2024) argue compression alone does not explain tokenization quality. However, these practical benefits are rea￾son enough to optimize com… view at source ↗
Figure 2
Figure 2. Figure 2: Example tokenization, with white tokens not [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: A node (blue) appears in the tokenization [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Total training time as a function of the number [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Validation data bytes per token for ToaST and [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 8
Figure 8. Figure 8: Validation Rényi efficiency (α = 2.5) for ToaST and several baselines as a function of vocabulary size. Higher is better, indicating a more uniform token distribution. The four ToaST series are very similar to each other, as are BPE and WordPiece. tokens than all baselines with a ~14–19x reduc￾tion at a vocabulary size of 65,536. ToaST also produces substantially more Root tokens than the baselines. These … view at source ↗
Figure 9
Figure 9. Figure 9: Example split hierarchy of pretokens ≻ morphemes ≻ characters ≻ bytes for crème brûlée. use the multi-byte character level, as it requires no additional external data, although its effect on English is minimal since English text is almost en￾tirely single-byte characters. Pretoken n-grams for superword construction and gold morpheme splits could be added to support the other levels. The vocabulary construc… view at source ↗
Figure 10
Figure 10. Figure 10: Cumulative total time to set up and solve the [PITH_FULL_IMAGE:figures/full_fig_p013_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Individual timing of each resolve step, with [PITH_FULL_IMAGE:figures/full_fig_p014_11.png] view at source ↗
Figure 14
Figure 14. Figure 14: Token categories for WordPiece, using the [PITH_FULL_IMAGE:figures/full_fig_p014_14.png] view at source ↗
Figure 12
Figure 12. Figure 12: Validation bytes per token of ToaST, zoomed [PITH_FULL_IMAGE:figures/full_fig_p014_12.png] view at source ↗
Figure 15
Figure 15. Figure 15: Token categories for UnigramLM, using the [PITH_FULL_IMAGE:figures/full_fig_p015_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Zipf plot of the validation token frequency [PITH_FULL_IMAGE:figures/full_fig_p015_16.png] view at source ↗
Figure 18
Figure 18. Figure 18: Venn diagram of overlap in the tokens used [PITH_FULL_IMAGE:figures/full_fig_p015_18.png] view at source ↗
read the original abstract

We introduce Tokenization with Split Trees (ToaST), a subword tokenization method that directly optimizes compression under a new recursive inference procedure. ToaST greedily splits each pretoken into a full binary tree using precomputed byte n-gram counts, independent of any vocabulary. Given a vocabulary, inference recursively descends each split tree and emits the first in-vocabulary node reached on each path. Vocabulary selection is formulated as an Integer Program (IP) that minimizes the total token count over all split trees under this inference procedure. The Linear Programming (LP) relaxation is near-integral in practice, yielding provably near-optimal vocabularies, with training time empirically scaling quadratically in the number of split trees. On English text, ToaST reduces token counts by more than 11% compared to BPE, WordPiece, and UnigramLM at vocabulary sizes of 40,960 and above, reducing the number of inference tokens for models using this tokenizer, thus extending the effective context length. ToaST also uses common single-byte tokens less frequently than these baselines, leading to a substantial improvement in Renyi efficiency. In experiments training 1.5B parameter language models, ToaST achieves the highest CORE score, outperforming baselines by 2.6%--7.6%, with significance for two of three, and scoring best on 13 of 22 individual tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces Tokenization with Split Trees (ToaST), a subword tokenization method that greedily builds full binary split trees for pretokens from precomputed byte n-gram counts, then selects a vocabulary by solving an integer program minimizing total token count under recursive descent inference to the first in-vocabulary node. It claims the LP relaxation is near-integral and yields provably near-optimal vocabularies, reports >11% token reduction versus BPE/WordPiece/UnigramLM on English text at vocab sizes >=40960, reduced single-byte token usage, and superior CORE scores (outperforming baselines by 2.6%-7.6%) when training 1.5B-parameter LMs.

Significance. If the near-integrality of the LP solution and the downstream gains hold, the explicit IP formulation of the compression objective under the new inference rule would be a clear strength, offering a more direct optimization path than heuristic methods like BPE. The reported quadratic scaling of training time and the falsifiable token-reduction predictions are also positive features. However, the significance is tempered by the absence of supporting analysis for the LP claim and potential confounds in the LM experiments.

major comments (2)
  1. [Abstract] Abstract: The statement that the LP relaxation is near-integral and yields provably near-optimal vocabularies is presented without derivation, error analysis, or verification that the reported 11% token reduction is robust to changes in pretokenization or domain.
  2. [LM training experiments] LM training experiments: The description of the 1.5B-parameter LM training does not specify whether a fixed token budget or fixed optimizer steps was used; given ToaST's >11% token reduction, this leaves open whether the 2.6%-7.6% CORE gains are attributable to the tokenizer or to processing more raw text.
minor comments (1)
  1. [Introduction] The interaction between the greedy binary splitting procedure and the IP objective could be clarified with a small illustrative example early in the manuscript.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments. We address each major comment below and indicate where revisions will be made to the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The statement that the LP relaxation is near-integral and yields provably near-optimal vocabularies is presented without derivation, error analysis, or verification that the reported 11% token reduction is robust to changes in pretokenization or domain.

    Authors: The manuscript formulates vocabulary selection as an integer program minimizing total token count under the recursive inference rule and reports that the LP relaxation is near-integral based on empirical solutions across vocabulary sizes. We agree the abstract states the near-optimality claim without a self-contained derivation or error analysis. We will revise the abstract to qualify the claim as empirically supported and expand the methods or appendix section with additional details on the observed integrality gaps. For robustness, our primary results use standard English pretokenization; we will add a brief discussion of sensitivity to alternative pretokenizers and note that the approach is domain-agnostic, with plans for broader verification. revision: yes

  2. Referee: [LM training experiments] LM training experiments: The description of the 1.5B-parameter LM training does not specify whether a fixed token budget or fixed optimizer steps was used; given ToaST's >11% token reduction, this leaves open whether the 2.6%-7.6% CORE gains are attributable to the tokenizer or to processing more raw text.

    Authors: We thank the referee for identifying this ambiguity. The 1.5B-parameter models were trained for a fixed number of optimizer steps across all tokenizers. This choice means ToaST's lower token count per sequence allows more raw text to be processed within the same step budget. We will revise the experimental description to state this explicitly and discuss the contribution of increased data exposure to the reported CORE improvements. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation: IP optimizes explicit objective with external baseline comparisons

full rationale

The paper defines split trees independently via byte n-gram counts, then formulates vocabulary selection as an IP minimizing total token count under the recursive descent inference rule. Reported token-count reductions (>11% vs BPE/WordPiece/UnigramLM) are direct empirical measurements against those external methods' own vocabularies and inference procedures, not a re-reporting of the IP objective on the same data. Downstream 1.5B LM CORE scores are presented as experimental results without any reduction to a fitted parameter or self-citation chain. No uniqueness theorems, ansatzes, or renamings from prior author work are invoked as load-bearing steps. The derivation chain is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The method depends on the domain assumption that byte n-gram counts computed once on a corpus are sufficient to guide optimal splits, and on the mathematical claim that the LP relaxation of the token-count integer program is near-integral; no free parameters or new entities are introduced in the abstract.

axioms (1)
  • domain assumption Byte n-gram counts computed independently of any vocabulary can be used to construct greedy full binary split trees that support the subsequent recursive inference procedure.
    Invoked in the description of how each pretoken is split.

pith-pipeline@v0.9.0 · 5795 in / 1403 out tokens · 36345 ms · 2026-05-22T05:43:51.391591+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

77 extracted references · 77 canonical work pages · 3 internal anchors

  1. [1]

    , title =

    Karp, Richard M. , title =. Complexity of Computer Computations , editor =. 1972 , pages =

  2. [2]

    and Johnson, David S

    Garey, Michael R. and Johnson, David S. , title =

  3. [3]

    Schrijver, Alexander , title =

  4. [4]

    Tokenization Workshop , year=

    How Much is Enough? The Diminishing Returns of Tokenization Training Data , author=. Tokenization Workshop , year=

  5. [5]

    The C Users Journal , year =

    Gage, Philip , title =. The C Users Journal , year =

  6. [6]

    Proceedings of the 41st International Conference on Machine Learning , pages =

    Getting the most out of your tokenizer for pre-training and domain adaptation , author =. Proceedings of the 41st International Conference on Machine Learning , pages =. 2024 , editor =

  7. [7]

    and Teng, Shang-Hua , title =

    Spielman, Daniel A. and Teng, Shang-Hua , title =. J. ACM , month = may, pages =. 2004 , issue_date =. doi:10.1145/990308.990310 , abstract =

  8. [8]

    Smith and Yejin Choi , booktitle=

    Alisa Liu and Jonathan Hayase and Valentin Hofmann and Sewoong Oh and Noah A. Smith and Yejin Choi , booktitle=. Super. 2025 , url=

  9. [9]

    Second Conference on Language Modeling , year=

    Boundless Byte Pair Encoding: Breaking the Pre-tokenization Barrier , author=. Second Conference on Language Modeling , year=

  10. [10]

    2026 , eprint=

    Faster Superword Tokenization , author=. 2026 , eprint=

  11. [11]

    2025 , eprint=

    Tokenisation over Bounded Alphabets is Hard , author=. 2025 , eprint=

  12. [12]

    Language Modeling Is Compression , url =

    Deletang, Gregoire and Ruoss, Anian and Duquenne, Paul-Ambroise and Catt, Elliot and Genewein, Tim and Mattern, Christopher and Grau-Moya, Jordi and Wenliang, Li Kevin and Aitchison, Matthew and Orseau, Laurent and Hutter, Marcus and Veness, Joel , booktitle =. Language Modeling Is Compression , url =

  13. [13]

    2021 , publisher=

    Pyomo--optimization modeling in python , author=. 2021 , publisher=

  14. [14]

    Mathematical Programming Computation , volume=

    Pyomo: modeling and solving mathematical programs in Python , author=. Mathematical Programming Computation , volume=. 2011 , publisher=

  15. [15]

    and Hall, J

    Huangfu, Q. and Hall, J. A. J. , title =. Mathematical Programming Computation , year =

  16. [16]

    Japanese and Korean voice search , year=

    Schuster, Mike and Nakajima, Kaisuke , booktitle=. Japanese and Korean voice search , year=

  17. [17]

    2016 , eprint=

    Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , author=. 2016 , eprint=

  18. [18]

    2025 , booktitle=

    A Partition Cover Approach to Tokenization , author=. 2025 , booktitle=

  19. [19]

    2025 , eprint=

    BPE Stays on SCRIPT: Structured Encoding for Robust Multilingual Pretokenization , author=. 2025 , eprint=

  20. [20]

    2023 , eprint=

    Language Model Tokenizers Introduce Unfairness Between Languages , author=. 2023 , eprint=

  21. [21]

    2025 , eprint=

    Tokenization Disparities as Infrastructure Bias: How Subword Systems Create Inequities in LLM Access and Efficiency , author=. 2025 , eprint=

  22. [22]

    2026 , eprint=

    The Script Tax: Measuring Tokenization-Driven Efficiency and Latency Disparities in Multilingual Language Models , author=. 2026 , eprint=

  23. [24]

    2025 , eprint=

    Parity-Aware Byte-Pair Encoding: Improving Cross-lingual Fairness in Tokenization , author=. 2025 , eprint=

  24. [25]

    2026 , eprint=

    Reducing Tokenization Premiums for Low-Resource Languages , author=. 2026 , eprint=

  25. [26]

    George Kingsley Zipf , title =

  26. [27]

    2025 , eprint=

    DataComp-LM: In search of the next generation of training sets for language models , author=. 2025 , eprint=

  27. [28]

    2025 , eprint=

    Nemotron-CLIMB: CLustering-based Iterative Data Mixture Bootstrapping for Language Model Pre-training , author=. 2025 , eprint=

  28. [29]

    Anthropic , year=. Claude

  29. [30]

    Orevaoghene Ahia, Sachin Kumar, Hila Gonen, Jungo Kasai, David Mortensen, Noah Smith, and Yulia Tsvetkov. 2023. https://doi.org/10.18653/v1/2023.emnlp-main.614 Do all languages cost the same? tokenization in the era of commercial language models . In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 9904--9923, ...

  30. [31]

    Mehdi Ali, Michael Fromm, Klaudia Thellmann, Richard Rutmann, Max L \"u bbering, Johannes Leveling, Katrin Klug, Jan Ebert, Niclas Doll, Jasper Buschhoff, Charvi Jain, Alexander Weber, Lena Jurkschat, Hammam Abdelwahab, Chelsea John, Pedro Ortiz Suarez, Malte Ostendorff, Samuel Weinbach, Rafet Sifa, and 2 others. 2024. https://doi.org/10.18653/v1/2024.fin...

  31. [32]

    Anthropic. 2026. https://www.anthropic.com/claude Claude O pus 4.6

  32. [33]

    Chang, and Benjamin Bergen

    Catherine Arnett, Tyler A. Chang, and Benjamin Bergen. 2024. https://aclanthology.org/2024.sigul-1.1/ A bit of a problem: Measurement disparities in dataset sizes across languages . In Proceedings of the 3rd Annual Meeting of the Special Interest Group on Under-resourced Languages @ LREC-COLING 2024, pages 1--9, Torino, Italia. ELRA and ICCL

  33. [34]

    Duygu Ataman and Marcello Federico. 2018. https://doi.org/10.18653/v1/P18-2049 Compositional representation of morphologically-rich input for neural machine translation . In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 305--311, Melbourne, Australia. Association for Computational L...

  34. [35]

    Kaj Bostrom and Greg Durrett. 2020. https://doi.org/10.18653/v1/2020.findings-emnlp.414 Byte pair encoding is suboptimal for language model pretraining . In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 4617--4624, Online. Association for Computational Linguistics

  35. [36]

    Bynum, Gabriel A

    Michael L. Bynum, Gabriel A. Hackebeil, William E. Hart, Carl D. Laird, Bethany L. Nicholson, John D. Siirola, Jean-Paul Watson, and David L. Woodruff. 2021. Pyomo--optimization modeling in python, third edition, volume 67. Springer Science & Business Media

  36. [37]

    Geoffrey Churchill and Steven Skiena. 2026. https://arxiv.org/abs/2601.13328 Reducing tokenization premiums for low-resource languages . Preprint, arXiv:2601.13328

  37. [38]

    Marco Cognetta, Vil \'e m Zouhar, Sangwhan Moon, and Naoaki Okazaki. 2024. https://aclanthology.org/2024.lrec-main.1469/ Two counterexamples to tokenization and the noiseless channel . In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 16897--16906, Torino, It...

  38. [39]

    Gautier Dagan, Gabriel Synnaeve, and Baptiste Roziere. 2024. https://proceedings.mlr.press/v235/dagan24a.html Getting the most out of your tokenizer for pre-training and domain adaptation . In Proceedings of the 41st International Conference on Machine Learning, volume 235 of Proceedings of Machine Learning Research, pages 9784--9805. PMLR

  39. [40]

    Shizhe Diao, Yu Yang, Yonggan Fu, Xin Dong, Dan Su, Markus Kliegl, Zijia Chen, Peter Belcak, Yoshi Suhara, Hongxu Yin, Mostofa Patwary, Yingyan, Lin, Jan Kautz, and Pavlo Molchanov. 2025. https://arxiv.org/abs/2504.13161 Nemotron-climb: Clustering-based iterative data mixture bootstrapping for language model pre-training . Preprint, arXiv:2504.13161

  40. [41]

    Aradhya Dixit and Shreem Dixit. 2026. https://arxiv.org/abs/2602.11174 The script tax: Measuring tokenization-driven efficiency and latency disparities in multilingual language models . Preprint, arXiv:2602.11174

  41. [42]

    Negar Foroutan, Clara Meister, Debjit Paul, Joel Niklaus, Sina Ahmadi, Antoine Bosselut, and Rico Sennrich. 2025. https://arxiv.org/abs/2508.04796 Parity-aware byte-pair encoding: Improving cross-lingual fairness in tokenization . Preprint, arXiv:2508.04796

  42. [43]

    Philip Gage. 1994. A new algorithm for data compression. The C Users Journal, 12(2):23--38

  43. [44]

    Matthias Gall \'e . 2019. https://doi.org/10.18653/v1/D19-1141 Investigating the effectiveness of BPE : The power of shorter sequences . In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 1375--1381, Hong Kong, China. Asso...

  44. [45]

    Garey and David S

    Michael R. Garey and David S. Johnson. 1979. Computers and Intractability: A Guide to the Theory of NP -Completeness . W. H. Freeman

  45. [46]

    Omer Goldman, Avi Caciularu, Matan Eyal, Kris Cao, Idan Szpektor, and Reut Tsarfaty. 2024. https://doi.org/10.18653/v1/2024.findings-acl.134 Unpacking tokenization: Evaluating text compression and its correlation with model performance . In Findings of the Association for Computational Linguistics: ACL 2024, pages 2274--2286, Bangkok, Thailand. Associatio...

  46. [47]

    William E Hart, Jean-Paul Watson, and David L Woodruff. 2011. Pyomo: modeling and solving mathematical programs in python. Mathematical Programming Computation, 3(3):219--260

  47. [48]

    Valentin Hofmann, Janet Pierrehumbert, and Hinrich Sch \"u tze. 2021. https://doi.org/10.18653/v1/2021.acl-long.279 Superbizarre is not superb: Derivational morphology improves BERT ' s interpretation of complex words . In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on...

  48. [49]

    Valentin Hofmann, Hinrich Schuetze, and Janet Pierrehumbert. 2022. https://doi.org/10.18653/v1/2022.acl-short.43 An embarrassingly simple method to mitigate undesirable properties of pretrained language model tokenizers . In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 385--393, Du...

  49. [50]

    Huangfu and J

    Q. Huangfu and J. A. J. Hall. 2018. https://doi.org/10.1007/s12532-017-0130-5 Parallelizing the dual revised simplex method . Mathematical Programming Computation, 10(1):119--142

  50. [51]

    Eugene Jang, Kimin Lee, Jin-Woo Chung, Keuntae Park, and Seungwon Shin. 2025. https://doi.org/10.18653/v1/2025.emnlp-main.919 Improbable bigrams expose vulnerabilities of incomplete tokens in byte-level tokenizers . In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 18209--18216, Suzhou, China. Association for...

  51. [52]

    Richard M. Karp. 1972. Reducibility among combinatorial problems. In Raymond E. Miller and James W. Thatcher, editors, Complexity of Computer Computations, pages 85--103. Plenum Press, New York

  52. [53]

    Violeta Kastreva, Philip Whittington, Dennis Komm, and Tiago Pimentel. 2025. https://arxiv.org/abs/2511.15709 Tokenisation over bounded alphabets is hard . Preprint, arXiv:2511.15709

  53. [54]

    Taku Kudo. 2018. https://doi.org/10.18653/v1/P18-1007 Subword regularization: Improving neural network translation models with multiple subword candidates . In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 66--75, Melbourne, Australia. Association for Computational Linguistics

  54. [55]

    Sander Land and Catherine Arnett. 2025. https://arxiv.org/abs/2505.24689 Bpe stays on script: Structured encoding for robust multilingual pretokenization . Preprint, arXiv:2505.24689

  55. [56]

    Sander Land and Max Bartolo. 2024. https://doi.org/10.18653/v1/2024.emnlp-main.649 Fishing for magikarp: Automatically detecting under-trained tokens in large language models . In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 11631--11646, Miami, Florida, USA. Association for Computational Linguistics

  56. [57]

    Jeffrey Li, Alex Fang, Georgios Smyrnis, Maor Ivgi, Matt Jordan, Samir Gadre, Hritik Bansal, Etash Guha, Sedrick Keh, Kushal Arora, Saurabh Garg, Rui Xin, Niklas Muennighoff, Reinhard Heckel, Jean Mercat, Mayee Chen, Suchin Gururangan, Mitchell Wortsman, Alon Albalak, and 40 others. 2025. https://arxiv.org/abs/2406.11794 Datacomp-lm: In search of the next...

  57. [58]

    Jia Peng Lim, Shawn Tan, Davin Choo, and Hady W. Lauw. 2025. A partition cover approach to tokenization. In The Thirty-ninth Annual Conference on Neural Information Processing Systems

  58. [59]

    Tomasz Limisiewicz, Terra Blevins, Hila Gonen, Orevaoghene Ahia, and Luke Zettlemoyer. 2024. https://doi.org/10.18653/v1/2024.acl-long.804 MYTE : Morphology-driven byte encoding for better and fairer multilingual language modeling . In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15...

  59. [60]

    Smith, and Yejin Choi

    Alisa Liu, Jonathan Hayase, Valentin Hofmann, Sewoong Oh, Noah A. Smith, and Yejin Choi. 2025. https://openreview.net/forum?id=lcDRvffeNP Super BPE : Space travel for language models . In Second Conference on Language Modeling

  60. [61]

    Lundin, Ada Zhang, Nihal Karim, Hamza Louzan, Guohao Wei, David Ifeoluwa Adelani, and Cody Carroll

    Jessica M. Lundin, Ada Zhang, Nihal Karim, Hamza Louzan, Guohao Wei, David Ifeoluwa Adelani, and Cody Carroll. 2026. https://doi.org/10.18653/v1/2026.africanlp-main.10 The token tax: Systematic bias in multilingual tokenization . In Proceedings of the 7th Workshop on African Natural Language Processing (AfricaNLP 2026), page 103–112. Association for Compu...

  61. [62]

    Rossi, and Thien Huu Nguyen

    Thuat Nguyen, Chien Van Nguyen, Viet Dac Lai, Hieu Man, Nghia Trung Ngo, Franck Dernoncourt, Ryan A. Rossi, and Thien Huu Nguyen. 2024. https://aclanthology.org/2024.lrec-main.377/ C ultura X : A cleaned, enormous, and multilingual dataset for large language models in 167 languages . In Proceedings of the 2024 Joint International Conference on Computation...

  62. [63]

    Zhang, Coleman Haley, Kenneth Steimel, Han Liu, and Lane Schwartz

    Hyunji Hayley Park, Katherine J. Zhang, Coleman Haley, Kenneth Steimel, Han Liu, and Lane Schwartz. 2021. https://doi.org/10.1162/tacl_a_00365 Morphology matters: A multilingual language modeling analysis . Transactions of the Association for Computational Linguistics, 9:261--276

  63. [64]

    Aleksandar Petrov, Emanuele La Malfa, Philip H. S. Torr, and Adel Bibi. 2023. https://arxiv.org/abs/2305.15425 Language model tokenizers introduce unfairness between languages . Preprint, arXiv:2305.15425

  64. [65]

    Varshini Reddy, Craig W Schmidt, Yuval Pinter, and Chris Tanner. 2026. https://openreview.net/forum?id=IETQ36gehE How much is enough? the diminishing returns of tokenization training data . In Tokenization Workshop

  65. [66]

    Phillip Rust, Jonas Pfeiffer, Ivan Vuli \'c , Sebastian Ruder, and Iryna Gurevych. 2021. https://doi.org/10.18653/v1/2021.acl-long.243 How good is your tokenizer? on the monolingual performance of multilingual language models . In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Confe...

  66. [67]

    Craig W Schmidt, Varshini Reddy, Chris Tanner, and Yuval Pinter. 2025. https://openreview.net/forum?id=oPAjXGV8qQ Boundless byte pair encoding: Breaking the pre-tokenization barrier . In Second Conference on Language Modeling

  67. [68]

    Craig W Schmidt, Varshini Reddy, Haoran Zhang, Alec Alameddine, Omri Uzan, Yuval Pinter, and Chris Tanner. 2024. https://doi.org/10.18653/v1/2024.emnlp-main.40 Tokenization is more than compression . In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 678--702, Miami, Florida, USA. Association for Computational...

  68. [69]

    Faster Superword Tokenization

    Craig W. Schmidt, Chris Tanner, and Yuval Pinter. 2026. https://arxiv.org/abs/2604.05192 Faster superword tokenization . Preprint, arXiv:2604.05192

  69. [70]

    Mike Schuster and Kaisuke Nakajima. 2012. https://doi.org/10.1109/ICASSP.2012.6289079 Japanese and korean voice search . In 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5149--5152

  70. [71]

    Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. https://doi.org/10.18653/v1/P16-1009 Improving neural machine translation models with monolingual data . In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 86--96, Berlin, Germany. Association for Computational Linguistics

  71. [72]

    Hailay Kidu Teklehaymanot and Wolfgang Nejdl. 2025. https://arxiv.org/abs/2510.12389 Tokenization disparities as infrastructure bias: How subword systems create inequities in llm access and efficiency . Preprint, arXiv:2510.12389

  72. [73]

    Schmidt, Chris Tanner, and Yuval Pinter

    Omri Uzan, Craig W. Schmidt, Chris Tanner, and Yuval Pinter. 2024. https://doi.org/10.18653/v1/2024.acl-short.73 Greed is all you need: An evaluation of tokenizer inference methods . In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 813--822, Bangkok, Thailand. Association for Comput...

  73. [74]

    Philip Whittington, Gregor Bachmann, and Tiago Pimentel. 2025. https://doi.org/10.18653/v1/2025.acl-long.1365 Tokenisation is NP -complete . In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 28133--28153, Vienna, Austria. Association for Computational Linguistics

  74. [75]

    Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, Jeff Klingner, Apurva Shah, Melvin Johnson, Xiaobing Liu, Łukasz Kaiser, Stephan Gouws, Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, and 12 others. 2016. https://arxiv.org/abs/1609.08144 Google's neural machine translatio...

  75. [76]

    Shaked Yehezkel and Yuval Pinter. 2023. https://doi.org/10.18653/v1/2023.eacl-main.45 Incorporating context into subword vocabularies . In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 623--635, Dubrovnik, Croatia. Association for Computational Linguistics

  76. [77]

    George Kingsley Zipf. 1949. Human Behavior and the Principle of Least Effort. Addison-Wesley

  77. [78]

    Vil \'e m Zouhar, Clara Meister, Juan Gastaldi, Li Du, Mrinmaya Sachan, and Ryan Cotterell. 2023. https://doi.org/10.18653/v1/2023.acl-long.284 Tokenization and the noiseless channel . In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5184--5207, Toronto, Canada. Association for Compu...