Tokenization with Split Trees

Adam Wiemerslage; Chris Tanner; Craig W. Schmidt; Michael Krumdick; Seth Ebner; Varshini Reddy; Yuval Pinter

arxiv: 2605.22705 · v1 · pith:VTOLHSYYnew · submitted 2026-05-21 · 💻 cs.CL

Tokenization with Split Trees

Craig W. Schmidt , Michael Krumdick , Adam Wiemerslage , Seth Ebner , Varshini Reddy , Yuval Pinter , Chris Tanner This is my paper

Pith reviewed 2026-05-22 05:43 UTC · model grok-4.3

classification 💻 cs.CL

keywords subword tokenizationsplit treesinteger programminglanguage modelingvocabulary selectioncompressionBPERenyi efficiency

0 comments

The pith

ToaST reduces English token counts by more than 11% versus BPE at vocabularies of 40960 and larger while raising 1.5B model CORE scores.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces ToaST, a tokenization method that first builds a full binary split tree for each pretoken from precomputed byte n-gram counts. It then chooses a vocabulary by solving an integer program that minimizes the number of tokens emitted when inference walks each tree and stops at the first in-vocabulary node. The linear-programming relaxation of this program stays near-integral, so good solutions are found quickly. Experiments on English data show the resulting tokenizers use over 11% fewer tokens than BPE, WordPiece, or UnigramLM at large vocabulary sizes and also improve Renyi efficiency by using fewer single-byte tokens. When 1.5B-parameter language models are trained with these tokenizers, ToaST records the highest CORE score among the tested methods.

Core claim

ToaST greedily splits each pretoken into a full binary tree using precomputed byte n-gram counts independent of any vocabulary. Given a vocabulary, inference recursively descends each split tree and emits the first in-vocabulary node reached on each path. Vocabulary selection is formulated as an integer program that minimizes the total token count over all split trees under this inference procedure, and the LP relaxation is near-integral in practice, yielding provably near-optimal vocabularies.

What carries the argument

The split tree, a full binary tree built greedily on byte n-grams for each pretoken, which supports recursive first-in-vocabulary emission and serves as the objective for the integer program that selects the vocabulary.

If this is right

Token counts drop by more than 11% on English text at vocabulary sizes of 40,960 and above, extending effective context length.
1.5B-parameter language models reach the highest CORE score and outperform baselines by 2.6% to 7.6% with significance in two of three comparisons.
Common single-byte tokens appear less often, producing a substantial gain in Renyi efficiency.
The LP relaxation remains near-integral, so the same optimization procedure scales to practical vocabulary sizes with quadratic training time in the number of split trees.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same split-tree construction could be tried on non-English text by replacing the byte n-gram statistics with those of the target language.
Because training time grows quadratically with the number of pretokens, sampling a representative subset of the corpus may be needed before applying ToaST to very large datasets.
The reduction in token count may also improve performance on sequence-to-sequence tasks such as machine translation that are sensitive to sequence length.

Load-bearing premise

That the greedy binary splitting of pretokens using precomputed byte n-gram counts, followed by recursive descent to the first in-vocabulary node, produces a compression objective whose integer-program solution yields vocabularies that are near-optimal in actual downstream use.

What would settle it

Measuring token counts and downstream CORE scores for a 1.5B model trained with a ToaST vocabulary chosen at size 40960 on a fresh English corpus and checking whether the reported gains versus BPE still appear.

Figures

Figures reproduced from arXiv: 2605.22705 by Adam Wiemerslage, Chris Tanner, Craig W. Schmidt, Michael Krumdick, Seth Ebner, Varshini Reddy, Yuval Pinter.

**Figure 1.** Figure 1: Example split tree for ␣Kentucky. context window. The relationship between compression and downstream task performance is less clear. Some studies (Gallé, 2019; Rust et al., 2021; Goldman et al., 2024) find a correlation, while others (Schmidt et al., 2024; Ali et al., 2024) argue compression alone does not explain tokenization quality. However, these practical benefits are reason enough to optimize com… view at source ↗

**Figure 2.** Figure 2: Example tokenization, with white tokens not [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: A node (blue) appears in the tokenization [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Total training time as a function of the number [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Validation data bytes per token for ToaST and [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 8.** Figure 8: Validation Rényi efficiency (α = 2.5) for ToaST and several baselines as a function of vocabulary size. Higher is better, indicating a more uniform token distribution. The four ToaST series are very similar to each other, as are BPE and WordPiece. tokens than all baselines with a ~14–19x reduction at a vocabulary size of 65,536. ToaST also produces substantially more Root tokens than the baselines. These … view at source ↗

**Figure 9.** Figure 9: Example split hierarchy of pretokens ≻ morphemes ≻ characters ≻ bytes for crème brûlée. use the multi-byte character level, as it requires no additional external data, although its effect on English is minimal since English text is almost entirely single-byte characters. Pretoken n-grams for superword construction and gold morpheme splits could be added to support the other levels. The vocabulary construc… view at source ↗

**Figure 10.** Figure 10: Cumulative total time to set up and solve the [PITH_FULL_IMAGE:figures/full_fig_p013_10.png] view at source ↗

**Figure 11.** Figure 11: Individual timing of each resolve step, with [PITH_FULL_IMAGE:figures/full_fig_p014_11.png] view at source ↗

**Figure 14.** Figure 14: Token categories for WordPiece, using the [PITH_FULL_IMAGE:figures/full_fig_p014_14.png] view at source ↗

**Figure 12.** Figure 12: Validation bytes per token of ToaST, zoomed [PITH_FULL_IMAGE:figures/full_fig_p014_12.png] view at source ↗

**Figure 15.** Figure 15: Token categories for UnigramLM, using the [PITH_FULL_IMAGE:figures/full_fig_p015_15.png] view at source ↗

**Figure 16.** Figure 16: Zipf plot of the validation token frequency [PITH_FULL_IMAGE:figures/full_fig_p015_16.png] view at source ↗

**Figure 18.** Figure 18: Venn diagram of overlap in the tokens used [PITH_FULL_IMAGE:figures/full_fig_p015_18.png] view at source ↗

read the original abstract

We introduce Tokenization with Split Trees (ToaST), a subword tokenization method that directly optimizes compression under a new recursive inference procedure. ToaST greedily splits each pretoken into a full binary tree using precomputed byte n-gram counts, independent of any vocabulary. Given a vocabulary, inference recursively descends each split tree and emits the first in-vocabulary node reached on each path. Vocabulary selection is formulated as an Integer Program (IP) that minimizes the total token count over all split trees under this inference procedure. The Linear Programming (LP) relaxation is near-integral in practice, yielding provably near-optimal vocabularies, with training time empirically scaling quadratically in the number of split trees. On English text, ToaST reduces token counts by more than 11% compared to BPE, WordPiece, and UnigramLM at vocabulary sizes of 40,960 and above, reducing the number of inference tokens for models using this tokenizer, thus extending the effective context length. ToaST also uses common single-byte tokens less frequently than these baselines, leading to a substantial improvement in Renyi efficiency. In experiments training 1.5B parameter language models, ToaST achieves the highest CORE score, outperforming baselines by 2.6%--7.6%, with significance for two of three, and scoring best on 13 of 22 individual tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ToaST gives a clean new way to optimize vocabularies by building n-gram split trees first and then solving an IP directly on the recursive inference cost, but the LM gains rest on an unclear training protocol that could mix in extra data exposure.

read the letter

The core idea here is to precompute full binary split trees for every pretoken using byte n-gram counts, then define inference as walking down each tree until you hit the first vocabulary node. Vocabulary selection is turned into an explicit integer program whose objective is exactly the token count produced by that procedure. That separation and the direct objective are the genuinely new pieces compared with BPE-style or Unigram approaches in the citations.

Referee Report

2 major / 1 minor

Summary. The paper introduces Tokenization with Split Trees (ToaST), a subword tokenization method that greedily builds full binary split trees for pretokens from precomputed byte n-gram counts, then selects a vocabulary by solving an integer program minimizing total token count under recursive descent inference to the first in-vocabulary node. It claims the LP relaxation is near-integral and yields provably near-optimal vocabularies, reports >11% token reduction versus BPE/WordPiece/UnigramLM on English text at vocab sizes >=40960, reduced single-byte token usage, and superior CORE scores (outperforming baselines by 2.6%-7.6%) when training 1.5B-parameter LMs.

Significance. If the near-integrality of the LP solution and the downstream gains hold, the explicit IP formulation of the compression objective under the new inference rule would be a clear strength, offering a more direct optimization path than heuristic methods like BPE. The reported quadratic scaling of training time and the falsifiable token-reduction predictions are also positive features. However, the significance is tempered by the absence of supporting analysis for the LP claim and potential confounds in the LM experiments.

major comments (2)

[Abstract] Abstract: The statement that the LP relaxation is near-integral and yields provably near-optimal vocabularies is presented without derivation, error analysis, or verification that the reported 11% token reduction is robust to changes in pretokenization or domain.
[LM training experiments] LM training experiments: The description of the 1.5B-parameter LM training does not specify whether a fixed token budget or fixed optimizer steps was used; given ToaST's >11% token reduction, this leaves open whether the 2.6%-7.6% CORE gains are attributable to the tokenizer or to processing more raw text.

minor comments (1)

[Introduction] The interaction between the greedy binary splitting procedure and the IP objective could be clarified with a small illustrative example early in the manuscript.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments. We address each major comment below and indicate where revisions will be made to the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: The statement that the LP relaxation is near-integral and yields provably near-optimal vocabularies is presented without derivation, error analysis, or verification that the reported 11% token reduction is robust to changes in pretokenization or domain.

Authors: The manuscript formulates vocabulary selection as an integer program minimizing total token count under the recursive inference rule and reports that the LP relaxation is near-integral based on empirical solutions across vocabulary sizes. We agree the abstract states the near-optimality claim without a self-contained derivation or error analysis. We will revise the abstract to qualify the claim as empirically supported and expand the methods or appendix section with additional details on the observed integrality gaps. For robustness, our primary results use standard English pretokenization; we will add a brief discussion of sensitivity to alternative pretokenizers and note that the approach is domain-agnostic, with plans for broader verification. revision: yes
Referee: [LM training experiments] LM training experiments: The description of the 1.5B-parameter LM training does not specify whether a fixed token budget or fixed optimizer steps was used; given ToaST's >11% token reduction, this leaves open whether the 2.6%-7.6% CORE gains are attributable to the tokenizer or to processing more raw text.

Authors: We thank the referee for identifying this ambiguity. The 1.5B-parameter models were trained for a fixed number of optimizer steps across all tokenizers. This choice means ToaST's lower token count per sequence allows more raw text to be processed within the same step budget. We will revise the experimental description to state this explicitly and discuss the contribution of increased data exposure to the reported CORE improvements. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation: IP optimizes explicit objective with external baseline comparisons

full rationale

The paper defines split trees independently via byte n-gram counts, then formulates vocabulary selection as an IP minimizing total token count under the recursive descent inference rule. Reported token-count reductions (>11% vs BPE/WordPiece/UnigramLM) are direct empirical measurements against those external methods' own vocabularies and inference procedures, not a re-reporting of the IP objective on the same data. Downstream 1.5B LM CORE scores are presented as experimental results without any reduction to a fitted parameter or self-citation chain. No uniqueness theorems, ansatzes, or renamings from prior author work are invoked as load-bearing steps. The derivation chain is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The method depends on the domain assumption that byte n-gram counts computed once on a corpus are sufficient to guide optimal splits, and on the mathematical claim that the LP relaxation of the token-count integer program is near-integral; no free parameters or new entities are introduced in the abstract.

axioms (1)

domain assumption Byte n-gram counts computed independently of any vocabulary can be used to construct greedy full binary split trees that support the subsequent recursive inference procedure.
Invoked in the description of how each pretoken is split.

pith-pipeline@v0.9.0 · 5795 in / 1403 out tokens · 36345 ms · 2026-05-22T05:43:51.391591+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Vocabulary selection is formulated as an Integer Program (IP) that minimizes the total token count over all split trees under this inference procedure.
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean bare_distinguishability_of_absolute_floor unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

ToaST greedily splits each pretoken into a full binary tree using precomputed byte n-gram counts, independent of any vocabulary.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

77 extracted references · 77 canonical work pages · 3 internal anchors

[1]

, title =

Karp, Richard M. , title =. Complexity of Computer Computations , editor =. 1972 , pages =

work page 1972
[2]

and Johnson, David S

Garey, Michael R. and Johnson, David S. , title =

work page
[3]

Schrijver, Alexander , title =

work page
[4]

Tokenization Workshop , year=

How Much is Enough? The Diminishing Returns of Tokenization Training Data , author=. Tokenization Workshop , year=

work page
[5]

The C Users Journal , year =

Gage, Philip , title =. The C Users Journal , year =

work page
[6]

Proceedings of the 41st International Conference on Machine Learning , pages =

Getting the most out of your tokenizer for pre-training and domain adaptation , author =. Proceedings of the 41st International Conference on Machine Learning , pages =. 2024 , editor =

work page 2024
[7]

and Teng, Shang-Hua , title =

Spielman, Daniel A. and Teng, Shang-Hua , title =. J. ACM , month = may, pages =. 2004 , issue_date =. doi:10.1145/990308.990310 , abstract =

work page doi:10.1145/990308.990310 2004
[8]

Smith and Yejin Choi , booktitle=

Alisa Liu and Jonathan Hayase and Valentin Hofmann and Sewoong Oh and Noah A. Smith and Yejin Choi , booktitle=. Super. 2025 , url=

work page 2025
[9]

Second Conference on Language Modeling , year=

Boundless Byte Pair Encoding: Breaking the Pre-tokenization Barrier , author=. Second Conference on Language Modeling , year=

work page
[10]

2026 , eprint=

Faster Superword Tokenization , author=. 2026 , eprint=

work page 2026
[11]

2025 , eprint=

Tokenisation over Bounded Alphabets is Hard , author=. 2025 , eprint=

work page 2025
[12]

Language Modeling Is Compression , url =

Deletang, Gregoire and Ruoss, Anian and Duquenne, Paul-Ambroise and Catt, Elliot and Genewein, Tim and Mattern, Christopher and Grau-Moya, Jordi and Wenliang, Li Kevin and Aitchison, Matthew and Orseau, Laurent and Hutter, Marcus and Veness, Joel , booktitle =. Language Modeling Is Compression , url =

work page
[13]

2021 , publisher=

Pyomo--optimization modeling in python , author=. 2021 , publisher=

work page 2021
[14]

Mathematical Programming Computation , volume=

Pyomo: modeling and solving mathematical programs in Python , author=. Mathematical Programming Computation , volume=. 2011 , publisher=

work page 2011
[15]

and Hall, J

Huangfu, Q. and Hall, J. A. J. , title =. Mathematical Programming Computation , year =

work page
[16]

Japanese and Korean voice search , year=

Schuster, Mike and Nakajima, Kaisuke , booktitle=. Japanese and Korean voice search , year=

work page
[17]

2016 , eprint=

Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , author=. 2016 , eprint=

work page 2016
[18]

2025 , booktitle=

A Partition Cover Approach to Tokenization , author=. 2025 , booktitle=

work page 2025
[19]

2025 , eprint=

BPE Stays on SCRIPT: Structured Encoding for Robust Multilingual Pretokenization , author=. 2025 , eprint=

work page 2025
[20]

2023 , eprint=

Language Model Tokenizers Introduce Unfairness Between Languages , author=. 2023 , eprint=

work page 2023
[21]

2025 , eprint=

Tokenization Disparities as Infrastructure Bias: How Subword Systems Create Inequities in LLM Access and Efficiency , author=. 2025 , eprint=

work page 2025
[22]

2026 , eprint=

The Script Tax: Measuring Tokenization-Driven Efficiency and Latency Disparities in Multilingual Language Models , author=. 2026 , eprint=

work page 2026
[24]

2025 , eprint=

Parity-Aware Byte-Pair Encoding: Improving Cross-lingual Fairness in Tokenization , author=. 2025 , eprint=

work page 2025
[25]

2026 , eprint=

Reducing Tokenization Premiums for Low-Resource Languages , author=. 2026 , eprint=

work page 2026
[26]

George Kingsley Zipf , title =

work page
[27]

2025 , eprint=

DataComp-LM: In search of the next generation of training sets for language models , author=. 2025 , eprint=

work page 2025
[28]

2025 , eprint=

Nemotron-CLIMB: CLustering-based Iterative Data Mixture Bootstrapping for Language Model Pre-training , author=. 2025 , eprint=

work page 2025
[29]

Anthropic , year=. Claude

work page
[30]

Orevaoghene Ahia, Sachin Kumar, Hila Gonen, Jungo Kasai, David Mortensen, Noah Smith, and Yulia Tsvetkov. 2023. https://doi.org/10.18653/v1/2023.emnlp-main.614 Do all languages cost the same? tokenization in the era of commercial language models . In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 9904--9923, ...

work page doi:10.18653/v1/2023.emnlp-main.614 2023
[31]

Mehdi Ali, Michael Fromm, Klaudia Thellmann, Richard Rutmann, Max L \"u bbering, Johannes Leveling, Katrin Klug, Jan Ebert, Niclas Doll, Jasper Buschhoff, Charvi Jain, Alexander Weber, Lena Jurkschat, Hammam Abdelwahab, Chelsea John, Pedro Ortiz Suarez, Malte Ostendorff, Samuel Weinbach, Rafet Sifa, and 2 others. 2024. https://doi.org/10.18653/v1/2024.fin...

work page doi:10.18653/v1/2024.findings-naacl.247 2024
[32]

Anthropic. 2026. https://www.anthropic.com/claude Claude O pus 4.6

work page 2026
[33]

Chang, and Benjamin Bergen

Catherine Arnett, Tyler A. Chang, and Benjamin Bergen. 2024. https://aclanthology.org/2024.sigul-1.1/ A bit of a problem: Measurement disparities in dataset sizes across languages . In Proceedings of the 3rd Annual Meeting of the Special Interest Group on Under-resourced Languages @ LREC-COLING 2024, pages 1--9, Torino, Italia. ELRA and ICCL

work page 2024
[34]

Duygu Ataman and Marcello Federico. 2018. https://doi.org/10.18653/v1/P18-2049 Compositional representation of morphologically-rich input for neural machine translation . In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 305--311, Melbourne, Australia. Association for Computational L...

work page doi:10.18653/v1/p18-2049 2018
[35]

Kaj Bostrom and Greg Durrett. 2020. https://doi.org/10.18653/v1/2020.findings-emnlp.414 Byte pair encoding is suboptimal for language model pretraining . In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 4617--4624, Online. Association for Computational Linguistics

work page doi:10.18653/v1/2020.findings-emnlp.414 2020
[36]

Bynum, Gabriel A

Michael L. Bynum, Gabriel A. Hackebeil, William E. Hart, Carl D. Laird, Bethany L. Nicholson, John D. Siirola, Jean-Paul Watson, and David L. Woodruff. 2021. Pyomo--optimization modeling in python, third edition, volume 67. Springer Science & Business Media

work page 2021
[37]

Geoffrey Churchill and Steven Skiena. 2026. https://arxiv.org/abs/2601.13328 Reducing tokenization premiums for low-resource languages . Preprint, arXiv:2601.13328

work page arXiv 2026
[38]

Marco Cognetta, Vil \'e m Zouhar, Sangwhan Moon, and Naoaki Okazaki. 2024. https://aclanthology.org/2024.lrec-main.1469/ Two counterexamples to tokenization and the noiseless channel . In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 16897--16906, Torino, It...

work page 2024
[39]

Gautier Dagan, Gabriel Synnaeve, and Baptiste Roziere. 2024. https://proceedings.mlr.press/v235/dagan24a.html Getting the most out of your tokenizer for pre-training and domain adaptation . In Proceedings of the 41st International Conference on Machine Learning, volume 235 of Proceedings of Machine Learning Research, pages 9784--9805. PMLR

work page 2024
[40]

Shizhe Diao, Yu Yang, Yonggan Fu, Xin Dong, Dan Su, Markus Kliegl, Zijia Chen, Peter Belcak, Yoshi Suhara, Hongxu Yin, Mostofa Patwary, Yingyan, Lin, Jan Kautz, and Pavlo Molchanov. 2025. https://arxiv.org/abs/2504.13161 Nemotron-climb: Clustering-based iterative data mixture bootstrapping for language model pre-training . Preprint, arXiv:2504.13161

work page arXiv 2025
[41]

Aradhya Dixit and Shreem Dixit. 2026. https://arxiv.org/abs/2602.11174 The script tax: Measuring tokenization-driven efficiency and latency disparities in multilingual language models . Preprint, arXiv:2602.11174

work page arXiv 2026
[42]

Negar Foroutan, Clara Meister, Debjit Paul, Joel Niklaus, Sina Ahmadi, Antoine Bosselut, and Rico Sennrich. 2025. https://arxiv.org/abs/2508.04796 Parity-aware byte-pair encoding: Improving cross-lingual fairness in tokenization . Preprint, arXiv:2508.04796

work page arXiv 2025
[43]

Philip Gage. 1994. A new algorithm for data compression. The C Users Journal, 12(2):23--38

work page 1994
[44]

Matthias Gall \'e . 2019. https://doi.org/10.18653/v1/D19-1141 Investigating the effectiveness of BPE : The power of shorter sequences . In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 1375--1381, Hong Kong, China. Asso...

work page doi:10.18653/v1/d19-1141 2019
[45]

Garey and David S

Michael R. Garey and David S. Johnson. 1979. Computers and Intractability: A Guide to the Theory of NP -Completeness . W. H. Freeman

work page 1979
[46]

Omer Goldman, Avi Caciularu, Matan Eyal, Kris Cao, Idan Szpektor, and Reut Tsarfaty. 2024. https://doi.org/10.18653/v1/2024.findings-acl.134 Unpacking tokenization: Evaluating text compression and its correlation with model performance . In Findings of the Association for Computational Linguistics: ACL 2024, pages 2274--2286, Bangkok, Thailand. Associatio...

work page doi:10.18653/v1/2024.findings-acl.134 2024
[47]

William E Hart, Jean-Paul Watson, and David L Woodruff. 2011. Pyomo: modeling and solving mathematical programs in python. Mathematical Programming Computation, 3(3):219--260

work page 2011
[48]

Valentin Hofmann, Janet Pierrehumbert, and Hinrich Sch \"u tze. 2021. https://doi.org/10.18653/v1/2021.acl-long.279 Superbizarre is not superb: Derivational morphology improves BERT ' s interpretation of complex words . In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on...

work page doi:10.18653/v1/2021.acl-long.279 2021
[49]

Valentin Hofmann, Hinrich Schuetze, and Janet Pierrehumbert. 2022. https://doi.org/10.18653/v1/2022.acl-short.43 An embarrassingly simple method to mitigate undesirable properties of pretrained language model tokenizers . In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 385--393, Du...

work page doi:10.18653/v1/2022.acl-short.43 2022
[50]

Huangfu and J

Q. Huangfu and J. A. J. Hall. 2018. https://doi.org/10.1007/s12532-017-0130-5 Parallelizing the dual revised simplex method . Mathematical Programming Computation, 10(1):119--142

work page doi:10.1007/s12532-017-0130-5 2018
[51]

Eugene Jang, Kimin Lee, Jin-Woo Chung, Keuntae Park, and Seungwon Shin. 2025. https://doi.org/10.18653/v1/2025.emnlp-main.919 Improbable bigrams expose vulnerabilities of incomplete tokens in byte-level tokenizers . In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 18209--18216, Suzhou, China. Association for...

work page doi:10.18653/v1/2025.emnlp-main.919 2025
[52]

Richard M. Karp. 1972. Reducibility among combinatorial problems. In Raymond E. Miller and James W. Thatcher, editors, Complexity of Computer Computations, pages 85--103. Plenum Press, New York

work page 1972
[53]

Violeta Kastreva, Philip Whittington, Dennis Komm, and Tiago Pimentel. 2025. https://arxiv.org/abs/2511.15709 Tokenisation over bounded alphabets is hard . Preprint, arXiv:2511.15709

work page arXiv 2025
[54]

Taku Kudo. 2018. https://doi.org/10.18653/v1/P18-1007 Subword regularization: Improving neural network translation models with multiple subword candidates . In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 66--75, Melbourne, Australia. Association for Computational Linguistics

work page doi:10.18653/v1/p18-1007 2018
[55]

Sander Land and Catherine Arnett. 2025. https://arxiv.org/abs/2505.24689 Bpe stays on script: Structured encoding for robust multilingual pretokenization . Preprint, arXiv:2505.24689

work page arXiv 2025
[56]

Sander Land and Max Bartolo. 2024. https://doi.org/10.18653/v1/2024.emnlp-main.649 Fishing for magikarp: Automatically detecting under-trained tokens in large language models . In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 11631--11646, Miami, Florida, USA. Association for Computational Linguistics

work page doi:10.18653/v1/2024.emnlp-main.649 2024
[57]

Jeffrey Li, Alex Fang, Georgios Smyrnis, Maor Ivgi, Matt Jordan, Samir Gadre, Hritik Bansal, Etash Guha, Sedrick Keh, Kushal Arora, Saurabh Garg, Rui Xin, Niklas Muennighoff, Reinhard Heckel, Jean Mercat, Mayee Chen, Suchin Gururangan, Mitchell Wortsman, Alon Albalak, and 40 others. 2025. https://arxiv.org/abs/2406.11794 Datacomp-lm: In search of the next...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[58]

Jia Peng Lim, Shawn Tan, Davin Choo, and Hady W. Lauw. 2025. A partition cover approach to tokenization. In The Thirty-ninth Annual Conference on Neural Information Processing Systems

work page 2025
[59]

Tomasz Limisiewicz, Terra Blevins, Hila Gonen, Orevaoghene Ahia, and Luke Zettlemoyer. 2024. https://doi.org/10.18653/v1/2024.acl-long.804 MYTE : Morphology-driven byte encoding for better and fairer multilingual language modeling . In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15...

work page doi:10.18653/v1/2024.acl-long.804 2024
[60]

Smith, and Yejin Choi

Alisa Liu, Jonathan Hayase, Valentin Hofmann, Sewoong Oh, Noah A. Smith, and Yejin Choi. 2025. https://openreview.net/forum?id=lcDRvffeNP Super BPE : Space travel for language models . In Second Conference on Language Modeling

work page 2025
[61]

Lundin, Ada Zhang, Nihal Karim, Hamza Louzan, Guohao Wei, David Ifeoluwa Adelani, and Cody Carroll

Jessica M. Lundin, Ada Zhang, Nihal Karim, Hamza Louzan, Guohao Wei, David Ifeoluwa Adelani, and Cody Carroll. 2026. https://doi.org/10.18653/v1/2026.africanlp-main.10 The token tax: Systematic bias in multilingual tokenization . In Proceedings of the 7th Workshop on African Natural Language Processing (AfricaNLP 2026), page 103–112. Association for Compu...

work page doi:10.18653/v1/2026.africanlp-main.10 2026
[62]

Rossi, and Thien Huu Nguyen

Thuat Nguyen, Chien Van Nguyen, Viet Dac Lai, Hieu Man, Nghia Trung Ngo, Franck Dernoncourt, Ryan A. Rossi, and Thien Huu Nguyen. 2024. https://aclanthology.org/2024.lrec-main.377/ C ultura X : A cleaned, enormous, and multilingual dataset for large language models in 167 languages . In Proceedings of the 2024 Joint International Conference on Computation...

work page 2024
[63]

Zhang, Coleman Haley, Kenneth Steimel, Han Liu, and Lane Schwartz

Hyunji Hayley Park, Katherine J. Zhang, Coleman Haley, Kenneth Steimel, Han Liu, and Lane Schwartz. 2021. https://doi.org/10.1162/tacl_a_00365 Morphology matters: A multilingual language modeling analysis . Transactions of the Association for Computational Linguistics, 9:261--276

work page doi:10.1162/tacl_a_00365 2021
[64]

Aleksandar Petrov, Emanuele La Malfa, Philip H. S. Torr, and Adel Bibi. 2023. https://arxiv.org/abs/2305.15425 Language model tokenizers introduce unfairness between languages . Preprint, arXiv:2305.15425

work page arXiv 2023
[65]

Varshini Reddy, Craig W Schmidt, Yuval Pinter, and Chris Tanner. 2026. https://openreview.net/forum?id=IETQ36gehE How much is enough? the diminishing returns of tokenization training data . In Tokenization Workshop

work page 2026
[66]

Phillip Rust, Jonas Pfeiffer, Ivan Vuli \'c , Sebastian Ruder, and Iryna Gurevych. 2021. https://doi.org/10.18653/v1/2021.acl-long.243 How good is your tokenizer? on the monolingual performance of multilingual language models . In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Confe...

work page doi:10.18653/v1/2021.acl-long.243 2021
[67]

Craig W Schmidt, Varshini Reddy, Chris Tanner, and Yuval Pinter. 2025. https://openreview.net/forum?id=oPAjXGV8qQ Boundless byte pair encoding: Breaking the pre-tokenization barrier . In Second Conference on Language Modeling

work page 2025
[68]

Craig W Schmidt, Varshini Reddy, Haoran Zhang, Alec Alameddine, Omri Uzan, Yuval Pinter, and Chris Tanner. 2024. https://doi.org/10.18653/v1/2024.emnlp-main.40 Tokenization is more than compression . In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 678--702, Miami, Florida, USA. Association for Computational...

work page doi:10.18653/v1/2024.emnlp-main.40 2024
[69]

Faster Superword Tokenization

Craig W. Schmidt, Chris Tanner, and Yuval Pinter. 2026. https://arxiv.org/abs/2604.05192 Faster superword tokenization . Preprint, arXiv:2604.05192

work page internal anchor Pith review Pith/arXiv arXiv 2026
[70]

Mike Schuster and Kaisuke Nakajima. 2012. https://doi.org/10.1109/ICASSP.2012.6289079 Japanese and korean voice search . In 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5149--5152

work page doi:10.1109/icassp.2012.6289079 2012
[71]

Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. https://doi.org/10.18653/v1/P16-1009 Improving neural machine translation models with monolingual data . In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 86--96, Berlin, Germany. Association for Computational Linguistics

work page doi:10.18653/v1/p16-1009 2016
[72]

Hailay Kidu Teklehaymanot and Wolfgang Nejdl. 2025. https://arxiv.org/abs/2510.12389 Tokenization disparities as infrastructure bias: How subword systems create inequities in llm access and efficiency . Preprint, arXiv:2510.12389

work page arXiv 2025
[73]

Schmidt, Chris Tanner, and Yuval Pinter

Omri Uzan, Craig W. Schmidt, Chris Tanner, and Yuval Pinter. 2024. https://doi.org/10.18653/v1/2024.acl-short.73 Greed is all you need: An evaluation of tokenizer inference methods . In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 813--822, Bangkok, Thailand. Association for Comput...

work page doi:10.18653/v1/2024.acl-short.73 2024
[74]

Philip Whittington, Gregor Bachmann, and Tiago Pimentel. 2025. https://doi.org/10.18653/v1/2025.acl-long.1365 Tokenisation is NP -complete . In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 28133--28153, Vienna, Austria. Association for Computational Linguistics

work page doi:10.18653/v1/2025.acl-long.1365 2025
[75]

Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, Jeff Klingner, Apurva Shah, Melvin Johnson, Xiaobing Liu, Łukasz Kaiser, Stephan Gouws, Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, and 12 others. 2016. https://arxiv.org/abs/1609.08144 Google's neural machine translatio...

work page internal anchor Pith review Pith/arXiv arXiv 2016
[76]

Shaked Yehezkel and Yuval Pinter. 2023. https://doi.org/10.18653/v1/2023.eacl-main.45 Incorporating context into subword vocabularies . In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 623--635, Dubrovnik, Croatia. Association for Computational Linguistics

work page doi:10.18653/v1/2023.eacl-main.45 2023
[77]

George Kingsley Zipf. 1949. Human Behavior and the Principle of Least Effort. Addison-Wesley

work page 1949
[78]

Vil \'e m Zouhar, Clara Meister, Juan Gastaldi, Li Du, Mrinmaya Sachan, and Ryan Cotterell. 2023. https://doi.org/10.18653/v1/2023.acl-long.284 Tokenization and the noiseless channel . In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5184--5207, Toronto, Canada. Association for Compu...

work page doi:10.18653/v1/2023.acl-long.284 2023

[1] [1]

, title =

Karp, Richard M. , title =. Complexity of Computer Computations , editor =. 1972 , pages =

work page 1972

[2] [2]

and Johnson, David S

Garey, Michael R. and Johnson, David S. , title =

work page

[3] [3]

Schrijver, Alexander , title =

work page

[4] [4]

Tokenization Workshop , year=

How Much is Enough? The Diminishing Returns of Tokenization Training Data , author=. Tokenization Workshop , year=

work page

[5] [5]

The C Users Journal , year =

Gage, Philip , title =. The C Users Journal , year =

work page

[6] [6]

Proceedings of the 41st International Conference on Machine Learning , pages =

Getting the most out of your tokenizer for pre-training and domain adaptation , author =. Proceedings of the 41st International Conference on Machine Learning , pages =. 2024 , editor =

work page 2024

[7] [7]

and Teng, Shang-Hua , title =

Spielman, Daniel A. and Teng, Shang-Hua , title =. J. ACM , month = may, pages =. 2004 , issue_date =. doi:10.1145/990308.990310 , abstract =

work page doi:10.1145/990308.990310 2004

[8] [8]

Smith and Yejin Choi , booktitle=

Alisa Liu and Jonathan Hayase and Valentin Hofmann and Sewoong Oh and Noah A. Smith and Yejin Choi , booktitle=. Super. 2025 , url=

work page 2025

[9] [9]

Second Conference on Language Modeling , year=

Boundless Byte Pair Encoding: Breaking the Pre-tokenization Barrier , author=. Second Conference on Language Modeling , year=

work page

[10] [10]

2026 , eprint=

Faster Superword Tokenization , author=. 2026 , eprint=

work page 2026

[11] [11]

2025 , eprint=

Tokenisation over Bounded Alphabets is Hard , author=. 2025 , eprint=

work page 2025

[12] [12]

Language Modeling Is Compression , url =

Deletang, Gregoire and Ruoss, Anian and Duquenne, Paul-Ambroise and Catt, Elliot and Genewein, Tim and Mattern, Christopher and Grau-Moya, Jordi and Wenliang, Li Kevin and Aitchison, Matthew and Orseau, Laurent and Hutter, Marcus and Veness, Joel , booktitle =. Language Modeling Is Compression , url =

work page

[13] [13]

2021 , publisher=

Pyomo--optimization modeling in python , author=. 2021 , publisher=

work page 2021

[14] [14]

Mathematical Programming Computation , volume=

Pyomo: modeling and solving mathematical programs in Python , author=. Mathematical Programming Computation , volume=. 2011 , publisher=

work page 2011

[15] [15]

and Hall, J

Huangfu, Q. and Hall, J. A. J. , title =. Mathematical Programming Computation , year =

work page

[16] [16]

Japanese and Korean voice search , year=

Schuster, Mike and Nakajima, Kaisuke , booktitle=. Japanese and Korean voice search , year=

work page

[17] [17]

2016 , eprint=

Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , author=. 2016 , eprint=

work page 2016

[18] [18]

2025 , booktitle=

A Partition Cover Approach to Tokenization , author=. 2025 , booktitle=

work page 2025

[19] [19]

2025 , eprint=

BPE Stays on SCRIPT: Structured Encoding for Robust Multilingual Pretokenization , author=. 2025 , eprint=

work page 2025

[20] [20]

2023 , eprint=

Language Model Tokenizers Introduce Unfairness Between Languages , author=. 2023 , eprint=

work page 2023

[21] [21]

2025 , eprint=

Tokenization Disparities as Infrastructure Bias: How Subword Systems Create Inequities in LLM Access and Efficiency , author=. 2025 , eprint=

work page 2025

[22] [22]

2026 , eprint=

The Script Tax: Measuring Tokenization-Driven Efficiency and Latency Disparities in Multilingual Language Models , author=. 2026 , eprint=

work page 2026

[23] [24]

2025 , eprint=

Parity-Aware Byte-Pair Encoding: Improving Cross-lingual Fairness in Tokenization , author=. 2025 , eprint=

work page 2025

[24] [25]

2026 , eprint=

Reducing Tokenization Premiums for Low-Resource Languages , author=. 2026 , eprint=

work page 2026

[25] [26]

George Kingsley Zipf , title =

work page

[26] [27]

2025 , eprint=

DataComp-LM: In search of the next generation of training sets for language models , author=. 2025 , eprint=

work page 2025

[27] [28]

2025 , eprint=

Nemotron-CLIMB: CLustering-based Iterative Data Mixture Bootstrapping for Language Model Pre-training , author=. 2025 , eprint=

work page 2025

[28] [29]

Anthropic , year=. Claude

work page

[29] [30]

Orevaoghene Ahia, Sachin Kumar, Hila Gonen, Jungo Kasai, David Mortensen, Noah Smith, and Yulia Tsvetkov. 2023. https://doi.org/10.18653/v1/2023.emnlp-main.614 Do all languages cost the same? tokenization in the era of commercial language models . In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 9904--9923, ...

work page doi:10.18653/v1/2023.emnlp-main.614 2023

[30] [31]

Mehdi Ali, Michael Fromm, Klaudia Thellmann, Richard Rutmann, Max L \"u bbering, Johannes Leveling, Katrin Klug, Jan Ebert, Niclas Doll, Jasper Buschhoff, Charvi Jain, Alexander Weber, Lena Jurkschat, Hammam Abdelwahab, Chelsea John, Pedro Ortiz Suarez, Malte Ostendorff, Samuel Weinbach, Rafet Sifa, and 2 others. 2024. https://doi.org/10.18653/v1/2024.fin...

work page doi:10.18653/v1/2024.findings-naacl.247 2024

[31] [32]

Anthropic. 2026. https://www.anthropic.com/claude Claude O pus 4.6

work page 2026

[32] [33]

Chang, and Benjamin Bergen

Catherine Arnett, Tyler A. Chang, and Benjamin Bergen. 2024. https://aclanthology.org/2024.sigul-1.1/ A bit of a problem: Measurement disparities in dataset sizes across languages . In Proceedings of the 3rd Annual Meeting of the Special Interest Group on Under-resourced Languages @ LREC-COLING 2024, pages 1--9, Torino, Italia. ELRA and ICCL

work page 2024

[33] [34]

Duygu Ataman and Marcello Federico. 2018. https://doi.org/10.18653/v1/P18-2049 Compositional representation of morphologically-rich input for neural machine translation . In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 305--311, Melbourne, Australia. Association for Computational L...

work page doi:10.18653/v1/p18-2049 2018

[34] [35]

Kaj Bostrom and Greg Durrett. 2020. https://doi.org/10.18653/v1/2020.findings-emnlp.414 Byte pair encoding is suboptimal for language model pretraining . In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 4617--4624, Online. Association for Computational Linguistics

work page doi:10.18653/v1/2020.findings-emnlp.414 2020

[35] [36]

Bynum, Gabriel A

Michael L. Bynum, Gabriel A. Hackebeil, William E. Hart, Carl D. Laird, Bethany L. Nicholson, John D. Siirola, Jean-Paul Watson, and David L. Woodruff. 2021. Pyomo--optimization modeling in python, third edition, volume 67. Springer Science & Business Media

work page 2021

[36] [37]

Geoffrey Churchill and Steven Skiena. 2026. https://arxiv.org/abs/2601.13328 Reducing tokenization premiums for low-resource languages . Preprint, arXiv:2601.13328

work page arXiv 2026

[37] [38]

Marco Cognetta, Vil \'e m Zouhar, Sangwhan Moon, and Naoaki Okazaki. 2024. https://aclanthology.org/2024.lrec-main.1469/ Two counterexamples to tokenization and the noiseless channel . In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 16897--16906, Torino, It...

work page 2024

[38] [39]

Gautier Dagan, Gabriel Synnaeve, and Baptiste Roziere. 2024. https://proceedings.mlr.press/v235/dagan24a.html Getting the most out of your tokenizer for pre-training and domain adaptation . In Proceedings of the 41st International Conference on Machine Learning, volume 235 of Proceedings of Machine Learning Research, pages 9784--9805. PMLR

work page 2024

[39] [40]

Shizhe Diao, Yu Yang, Yonggan Fu, Xin Dong, Dan Su, Markus Kliegl, Zijia Chen, Peter Belcak, Yoshi Suhara, Hongxu Yin, Mostofa Patwary, Yingyan, Lin, Jan Kautz, and Pavlo Molchanov. 2025. https://arxiv.org/abs/2504.13161 Nemotron-climb: Clustering-based iterative data mixture bootstrapping for language model pre-training . Preprint, arXiv:2504.13161

work page arXiv 2025

[40] [41]

Aradhya Dixit and Shreem Dixit. 2026. https://arxiv.org/abs/2602.11174 The script tax: Measuring tokenization-driven efficiency and latency disparities in multilingual language models . Preprint, arXiv:2602.11174

work page arXiv 2026

[41] [42]

Negar Foroutan, Clara Meister, Debjit Paul, Joel Niklaus, Sina Ahmadi, Antoine Bosselut, and Rico Sennrich. 2025. https://arxiv.org/abs/2508.04796 Parity-aware byte-pair encoding: Improving cross-lingual fairness in tokenization . Preprint, arXiv:2508.04796

work page arXiv 2025

[42] [43]

Philip Gage. 1994. A new algorithm for data compression. The C Users Journal, 12(2):23--38

work page 1994

[43] [44]

Matthias Gall \'e . 2019. https://doi.org/10.18653/v1/D19-1141 Investigating the effectiveness of BPE : The power of shorter sequences . In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 1375--1381, Hong Kong, China. Asso...

work page doi:10.18653/v1/d19-1141 2019

[44] [45]

Garey and David S

Michael R. Garey and David S. Johnson. 1979. Computers and Intractability: A Guide to the Theory of NP -Completeness . W. H. Freeman

work page 1979

[45] [46]

Omer Goldman, Avi Caciularu, Matan Eyal, Kris Cao, Idan Szpektor, and Reut Tsarfaty. 2024. https://doi.org/10.18653/v1/2024.findings-acl.134 Unpacking tokenization: Evaluating text compression and its correlation with model performance . In Findings of the Association for Computational Linguistics: ACL 2024, pages 2274--2286, Bangkok, Thailand. Associatio...

work page doi:10.18653/v1/2024.findings-acl.134 2024

[46] [47]

William E Hart, Jean-Paul Watson, and David L Woodruff. 2011. Pyomo: modeling and solving mathematical programs in python. Mathematical Programming Computation, 3(3):219--260

work page 2011

[47] [48]

Valentin Hofmann, Janet Pierrehumbert, and Hinrich Sch \"u tze. 2021. https://doi.org/10.18653/v1/2021.acl-long.279 Superbizarre is not superb: Derivational morphology improves BERT ' s interpretation of complex words . In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on...

work page doi:10.18653/v1/2021.acl-long.279 2021

[48] [49]

Valentin Hofmann, Hinrich Schuetze, and Janet Pierrehumbert. 2022. https://doi.org/10.18653/v1/2022.acl-short.43 An embarrassingly simple method to mitigate undesirable properties of pretrained language model tokenizers . In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 385--393, Du...

work page doi:10.18653/v1/2022.acl-short.43 2022

[49] [50]

Huangfu and J

Q. Huangfu and J. A. J. Hall. 2018. https://doi.org/10.1007/s12532-017-0130-5 Parallelizing the dual revised simplex method . Mathematical Programming Computation, 10(1):119--142

work page doi:10.1007/s12532-017-0130-5 2018

[50] [51]

Eugene Jang, Kimin Lee, Jin-Woo Chung, Keuntae Park, and Seungwon Shin. 2025. https://doi.org/10.18653/v1/2025.emnlp-main.919 Improbable bigrams expose vulnerabilities of incomplete tokens in byte-level tokenizers . In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 18209--18216, Suzhou, China. Association for...

work page doi:10.18653/v1/2025.emnlp-main.919 2025

[51] [52]

Richard M. Karp. 1972. Reducibility among combinatorial problems. In Raymond E. Miller and James W. Thatcher, editors, Complexity of Computer Computations, pages 85--103. Plenum Press, New York

work page 1972

[52] [53]

Violeta Kastreva, Philip Whittington, Dennis Komm, and Tiago Pimentel. 2025. https://arxiv.org/abs/2511.15709 Tokenisation over bounded alphabets is hard . Preprint, arXiv:2511.15709

work page arXiv 2025

[53] [54]

Taku Kudo. 2018. https://doi.org/10.18653/v1/P18-1007 Subword regularization: Improving neural network translation models with multiple subword candidates . In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 66--75, Melbourne, Australia. Association for Computational Linguistics

work page doi:10.18653/v1/p18-1007 2018

[54] [55]

Sander Land and Catherine Arnett. 2025. https://arxiv.org/abs/2505.24689 Bpe stays on script: Structured encoding for robust multilingual pretokenization . Preprint, arXiv:2505.24689

work page arXiv 2025

[55] [56]

Sander Land and Max Bartolo. 2024. https://doi.org/10.18653/v1/2024.emnlp-main.649 Fishing for magikarp: Automatically detecting under-trained tokens in large language models . In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 11631--11646, Miami, Florida, USA. Association for Computational Linguistics

work page doi:10.18653/v1/2024.emnlp-main.649 2024

[56] [57]

Jeffrey Li, Alex Fang, Georgios Smyrnis, Maor Ivgi, Matt Jordan, Samir Gadre, Hritik Bansal, Etash Guha, Sedrick Keh, Kushal Arora, Saurabh Garg, Rui Xin, Niklas Muennighoff, Reinhard Heckel, Jean Mercat, Mayee Chen, Suchin Gururangan, Mitchell Wortsman, Alon Albalak, and 40 others. 2025. https://arxiv.org/abs/2406.11794 Datacomp-lm: In search of the next...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[57] [58]

Jia Peng Lim, Shawn Tan, Davin Choo, and Hady W. Lauw. 2025. A partition cover approach to tokenization. In The Thirty-ninth Annual Conference on Neural Information Processing Systems

work page 2025

[58] [59]

Tomasz Limisiewicz, Terra Blevins, Hila Gonen, Orevaoghene Ahia, and Luke Zettlemoyer. 2024. https://doi.org/10.18653/v1/2024.acl-long.804 MYTE : Morphology-driven byte encoding for better and fairer multilingual language modeling . In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15...

work page doi:10.18653/v1/2024.acl-long.804 2024

[59] [60]

Smith, and Yejin Choi

Alisa Liu, Jonathan Hayase, Valentin Hofmann, Sewoong Oh, Noah A. Smith, and Yejin Choi. 2025. https://openreview.net/forum?id=lcDRvffeNP Super BPE : Space travel for language models . In Second Conference on Language Modeling

work page 2025

[60] [61]

Lundin, Ada Zhang, Nihal Karim, Hamza Louzan, Guohao Wei, David Ifeoluwa Adelani, and Cody Carroll

Jessica M. Lundin, Ada Zhang, Nihal Karim, Hamza Louzan, Guohao Wei, David Ifeoluwa Adelani, and Cody Carroll. 2026. https://doi.org/10.18653/v1/2026.africanlp-main.10 The token tax: Systematic bias in multilingual tokenization . In Proceedings of the 7th Workshop on African Natural Language Processing (AfricaNLP 2026), page 103–112. Association for Compu...

work page doi:10.18653/v1/2026.africanlp-main.10 2026

[61] [62]

Rossi, and Thien Huu Nguyen

Thuat Nguyen, Chien Van Nguyen, Viet Dac Lai, Hieu Man, Nghia Trung Ngo, Franck Dernoncourt, Ryan A. Rossi, and Thien Huu Nguyen. 2024. https://aclanthology.org/2024.lrec-main.377/ C ultura X : A cleaned, enormous, and multilingual dataset for large language models in 167 languages . In Proceedings of the 2024 Joint International Conference on Computation...

work page 2024

[62] [63]

Zhang, Coleman Haley, Kenneth Steimel, Han Liu, and Lane Schwartz

Hyunji Hayley Park, Katherine J. Zhang, Coleman Haley, Kenneth Steimel, Han Liu, and Lane Schwartz. 2021. https://doi.org/10.1162/tacl_a_00365 Morphology matters: A multilingual language modeling analysis . Transactions of the Association for Computational Linguistics, 9:261--276

work page doi:10.1162/tacl_a_00365 2021

[63] [64]

Aleksandar Petrov, Emanuele La Malfa, Philip H. S. Torr, and Adel Bibi. 2023. https://arxiv.org/abs/2305.15425 Language model tokenizers introduce unfairness between languages . Preprint, arXiv:2305.15425

work page arXiv 2023

[64] [65]

Varshini Reddy, Craig W Schmidt, Yuval Pinter, and Chris Tanner. 2026. https://openreview.net/forum?id=IETQ36gehE How much is enough? the diminishing returns of tokenization training data . In Tokenization Workshop

work page 2026

[65] [66]

Phillip Rust, Jonas Pfeiffer, Ivan Vuli \'c , Sebastian Ruder, and Iryna Gurevych. 2021. https://doi.org/10.18653/v1/2021.acl-long.243 How good is your tokenizer? on the monolingual performance of multilingual language models . In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Confe...

work page doi:10.18653/v1/2021.acl-long.243 2021

[66] [67]

Craig W Schmidt, Varshini Reddy, Chris Tanner, and Yuval Pinter. 2025. https://openreview.net/forum?id=oPAjXGV8qQ Boundless byte pair encoding: Breaking the pre-tokenization barrier . In Second Conference on Language Modeling

work page 2025

[67] [68]

Craig W Schmidt, Varshini Reddy, Haoran Zhang, Alec Alameddine, Omri Uzan, Yuval Pinter, and Chris Tanner. 2024. https://doi.org/10.18653/v1/2024.emnlp-main.40 Tokenization is more than compression . In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 678--702, Miami, Florida, USA. Association for Computational...

work page doi:10.18653/v1/2024.emnlp-main.40 2024

[68] [69]

Faster Superword Tokenization

Craig W. Schmidt, Chris Tanner, and Yuval Pinter. 2026. https://arxiv.org/abs/2604.05192 Faster superword tokenization . Preprint, arXiv:2604.05192

work page internal anchor Pith review Pith/arXiv arXiv 2026

[69] [70]

Mike Schuster and Kaisuke Nakajima. 2012. https://doi.org/10.1109/ICASSP.2012.6289079 Japanese and korean voice search . In 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5149--5152

work page doi:10.1109/icassp.2012.6289079 2012

[70] [71]

Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. https://doi.org/10.18653/v1/P16-1009 Improving neural machine translation models with monolingual data . In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 86--96, Berlin, Germany. Association for Computational Linguistics

work page doi:10.18653/v1/p16-1009 2016

[71] [72]

Hailay Kidu Teklehaymanot and Wolfgang Nejdl. 2025. https://arxiv.org/abs/2510.12389 Tokenization disparities as infrastructure bias: How subword systems create inequities in llm access and efficiency . Preprint, arXiv:2510.12389

work page arXiv 2025

[72] [73]

Schmidt, Chris Tanner, and Yuval Pinter

Omri Uzan, Craig W. Schmidt, Chris Tanner, and Yuval Pinter. 2024. https://doi.org/10.18653/v1/2024.acl-short.73 Greed is all you need: An evaluation of tokenizer inference methods . In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 813--822, Bangkok, Thailand. Association for Comput...

work page doi:10.18653/v1/2024.acl-short.73 2024

[73] [74]

Philip Whittington, Gregor Bachmann, and Tiago Pimentel. 2025. https://doi.org/10.18653/v1/2025.acl-long.1365 Tokenisation is NP -complete . In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 28133--28153, Vienna, Austria. Association for Computational Linguistics

work page doi:10.18653/v1/2025.acl-long.1365 2025

[74] [75]

Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, Jeff Klingner, Apurva Shah, Melvin Johnson, Xiaobing Liu, Łukasz Kaiser, Stephan Gouws, Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, and 12 others. 2016. https://arxiv.org/abs/1609.08144 Google's neural machine translatio...

work page internal anchor Pith review Pith/arXiv arXiv 2016

[75] [76]

Shaked Yehezkel and Yuval Pinter. 2023. https://doi.org/10.18653/v1/2023.eacl-main.45 Incorporating context into subword vocabularies . In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 623--635, Dubrovnik, Croatia. Association for Computational Linguistics

work page doi:10.18653/v1/2023.eacl-main.45 2023

[76] [77]

George Kingsley Zipf. 1949. Human Behavior and the Principle of Least Effort. Addison-Wesley

work page 1949

[77] [78]

Vil \'e m Zouhar, Clara Meister, Juan Gastaldi, Li Du, Mrinmaya Sachan, and Ryan Cotterell. 2023. https://doi.org/10.18653/v1/2023.acl-long.284 Tokenization and the noiseless channel . In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5184--5207, Toronto, Canada. Association for Compu...

work page doi:10.18653/v1/2023.acl-long.284 2023