Tokenization with Split Trees
Pith reviewed 2026-05-22 05:43 UTC · model grok-4.3
The pith
ToaST reduces English token counts by more than 11% versus BPE at vocabularies of 40960 and larger while raising 1.5B model CORE scores.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ToaST greedily splits each pretoken into a full binary tree using precomputed byte n-gram counts independent of any vocabulary. Given a vocabulary, inference recursively descends each split tree and emits the first in-vocabulary node reached on each path. Vocabulary selection is formulated as an integer program that minimizes the total token count over all split trees under this inference procedure, and the LP relaxation is near-integral in practice, yielding provably near-optimal vocabularies.
What carries the argument
The split tree, a full binary tree built greedily on byte n-grams for each pretoken, which supports recursive first-in-vocabulary emission and serves as the objective for the integer program that selects the vocabulary.
If this is right
- Token counts drop by more than 11% on English text at vocabulary sizes of 40,960 and above, extending effective context length.
- 1.5B-parameter language models reach the highest CORE score and outperform baselines by 2.6% to 7.6% with significance in two of three comparisons.
- Common single-byte tokens appear less often, producing a substantial gain in Renyi efficiency.
- The LP relaxation remains near-integral, so the same optimization procedure scales to practical vocabulary sizes with quadratic training time in the number of split trees.
Where Pith is reading between the lines
- The same split-tree construction could be tried on non-English text by replacing the byte n-gram statistics with those of the target language.
- Because training time grows quadratically with the number of pretokens, sampling a representative subset of the corpus may be needed before applying ToaST to very large datasets.
- The reduction in token count may also improve performance on sequence-to-sequence tasks such as machine translation that are sensitive to sequence length.
Load-bearing premise
That the greedy binary splitting of pretokens using precomputed byte n-gram counts, followed by recursive descent to the first in-vocabulary node, produces a compression objective whose integer-program solution yields vocabularies that are near-optimal in actual downstream use.
What would settle it
Measuring token counts and downstream CORE scores for a 1.5B model trained with a ToaST vocabulary chosen at size 40960 on a fresh English corpus and checking whether the reported gains versus BPE still appear.
Figures
read the original abstract
We introduce Tokenization with Split Trees (ToaST), a subword tokenization method that directly optimizes compression under a new recursive inference procedure. ToaST greedily splits each pretoken into a full binary tree using precomputed byte n-gram counts, independent of any vocabulary. Given a vocabulary, inference recursively descends each split tree and emits the first in-vocabulary node reached on each path. Vocabulary selection is formulated as an Integer Program (IP) that minimizes the total token count over all split trees under this inference procedure. The Linear Programming (LP) relaxation is near-integral in practice, yielding provably near-optimal vocabularies, with training time empirically scaling quadratically in the number of split trees. On English text, ToaST reduces token counts by more than 11% compared to BPE, WordPiece, and UnigramLM at vocabulary sizes of 40,960 and above, reducing the number of inference tokens for models using this tokenizer, thus extending the effective context length. ToaST also uses common single-byte tokens less frequently than these baselines, leading to a substantial improvement in Renyi efficiency. In experiments training 1.5B parameter language models, ToaST achieves the highest CORE score, outperforming baselines by 2.6%--7.6%, with significance for two of three, and scoring best on 13 of 22 individual tasks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Tokenization with Split Trees (ToaST), a subword tokenization method that greedily builds full binary split trees for pretokens from precomputed byte n-gram counts, then selects a vocabulary by solving an integer program minimizing total token count under recursive descent inference to the first in-vocabulary node. It claims the LP relaxation is near-integral and yields provably near-optimal vocabularies, reports >11% token reduction versus BPE/WordPiece/UnigramLM on English text at vocab sizes >=40960, reduced single-byte token usage, and superior CORE scores (outperforming baselines by 2.6%-7.6%) when training 1.5B-parameter LMs.
Significance. If the near-integrality of the LP solution and the downstream gains hold, the explicit IP formulation of the compression objective under the new inference rule would be a clear strength, offering a more direct optimization path than heuristic methods like BPE. The reported quadratic scaling of training time and the falsifiable token-reduction predictions are also positive features. However, the significance is tempered by the absence of supporting analysis for the LP claim and potential confounds in the LM experiments.
major comments (2)
- [Abstract] Abstract: The statement that the LP relaxation is near-integral and yields provably near-optimal vocabularies is presented without derivation, error analysis, or verification that the reported 11% token reduction is robust to changes in pretokenization or domain.
- [LM training experiments] LM training experiments: The description of the 1.5B-parameter LM training does not specify whether a fixed token budget or fixed optimizer steps was used; given ToaST's >11% token reduction, this leaves open whether the 2.6%-7.6% CORE gains are attributable to the tokenizer or to processing more raw text.
minor comments (1)
- [Introduction] The interaction between the greedy binary splitting procedure and the IP objective could be clarified with a small illustrative example early in the manuscript.
Simulated Author's Rebuttal
We thank the referee for their constructive comments. We address each major comment below and indicate where revisions will be made to the manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract: The statement that the LP relaxation is near-integral and yields provably near-optimal vocabularies is presented without derivation, error analysis, or verification that the reported 11% token reduction is robust to changes in pretokenization or domain.
Authors: The manuscript formulates vocabulary selection as an integer program minimizing total token count under the recursive inference rule and reports that the LP relaxation is near-integral based on empirical solutions across vocabulary sizes. We agree the abstract states the near-optimality claim without a self-contained derivation or error analysis. We will revise the abstract to qualify the claim as empirically supported and expand the methods or appendix section with additional details on the observed integrality gaps. For robustness, our primary results use standard English pretokenization; we will add a brief discussion of sensitivity to alternative pretokenizers and note that the approach is domain-agnostic, with plans for broader verification. revision: yes
-
Referee: [LM training experiments] LM training experiments: The description of the 1.5B-parameter LM training does not specify whether a fixed token budget or fixed optimizer steps was used; given ToaST's >11% token reduction, this leaves open whether the 2.6%-7.6% CORE gains are attributable to the tokenizer or to processing more raw text.
Authors: We thank the referee for identifying this ambiguity. The 1.5B-parameter models were trained for a fixed number of optimizer steps across all tokenizers. This choice means ToaST's lower token count per sequence allows more raw text to be processed within the same step budget. We will revise the experimental description to state this explicitly and discuss the contribution of increased data exposure to the reported CORE improvements. revision: yes
Circularity Check
No circularity in derivation: IP optimizes explicit objective with external baseline comparisons
full rationale
The paper defines split trees independently via byte n-gram counts, then formulates vocabulary selection as an IP minimizing total token count under the recursive descent inference rule. Reported token-count reductions (>11% vs BPE/WordPiece/UnigramLM) are direct empirical measurements against those external methods' own vocabularies and inference procedures, not a re-reporting of the IP objective on the same data. Downstream 1.5B LM CORE scores are presented as experimental results without any reduction to a fitted parameter or self-citation chain. No uniqueness theorems, ansatzes, or renamings from prior author work are invoked as load-bearing steps. The derivation chain is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Byte n-gram counts computed independently of any vocabulary can be used to construct greedy full binary split trees that support the subsequent recursive inference procedure.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Vocabulary selection is formulated as an Integer Program (IP) that minimizes the total token count over all split trees under this inference procedure.
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanbare_distinguishability_of_absolute_floor unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
ToaST greedily splits each pretoken into a full binary tree using precomputed byte n-gram counts, independent of any vocabulary.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
- [1]
- [2]
-
[3]
Schrijver, Alexander , title =
-
[4]
How Much is Enough? The Diminishing Returns of Tokenization Training Data , author=. Tokenization Workshop , year=
- [5]
-
[6]
Proceedings of the 41st International Conference on Machine Learning , pages =
Getting the most out of your tokenizer for pre-training and domain adaptation , author =. Proceedings of the 41st International Conference on Machine Learning , pages =. 2024 , editor =
work page 2024
-
[7]
Spielman, Daniel A. and Teng, Shang-Hua , title =. J. ACM , month = may, pages =. 2004 , issue_date =. doi:10.1145/990308.990310 , abstract =
-
[8]
Smith and Yejin Choi , booktitle=
Alisa Liu and Jonathan Hayase and Valentin Hofmann and Sewoong Oh and Noah A. Smith and Yejin Choi , booktitle=. Super. 2025 , url=
work page 2025
-
[9]
Second Conference on Language Modeling , year=
Boundless Byte Pair Encoding: Breaking the Pre-tokenization Barrier , author=. Second Conference on Language Modeling , year=
- [10]
- [11]
-
[12]
Language Modeling Is Compression , url =
Deletang, Gregoire and Ruoss, Anian and Duquenne, Paul-Ambroise and Catt, Elliot and Genewein, Tim and Mattern, Christopher and Grau-Moya, Jordi and Wenliang, Li Kevin and Aitchison, Matthew and Orseau, Laurent and Hutter, Marcus and Veness, Joel , booktitle =. Language Modeling Is Compression , url =
-
[13]
Pyomo--optimization modeling in python , author=. 2021 , publisher=
work page 2021
-
[14]
Mathematical Programming Computation , volume=
Pyomo: modeling and solving mathematical programs in Python , author=. Mathematical Programming Computation , volume=. 2011 , publisher=
work page 2011
-
[15]
Huangfu, Q. and Hall, J. A. J. , title =. Mathematical Programming Computation , year =
-
[16]
Japanese and Korean voice search , year=
Schuster, Mike and Nakajima, Kaisuke , booktitle=. Japanese and Korean voice search , year=
-
[17]
Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , author=. 2016 , eprint=
work page 2016
-
[18]
A Partition Cover Approach to Tokenization , author=. 2025 , booktitle=
work page 2025
-
[19]
BPE Stays on SCRIPT: Structured Encoding for Robust Multilingual Pretokenization , author=. 2025 , eprint=
work page 2025
-
[20]
Language Model Tokenizers Introduce Unfairness Between Languages , author=. 2023 , eprint=
work page 2023
-
[21]
Tokenization Disparities as Infrastructure Bias: How Subword Systems Create Inequities in LLM Access and Efficiency , author=. 2025 , eprint=
work page 2025
-
[22]
The Script Tax: Measuring Tokenization-Driven Efficiency and Latency Disparities in Multilingual Language Models , author=. 2026 , eprint=
work page 2026
-
[24]
Parity-Aware Byte-Pair Encoding: Improving Cross-lingual Fairness in Tokenization , author=. 2025 , eprint=
work page 2025
-
[25]
Reducing Tokenization Premiums for Low-Resource Languages , author=. 2026 , eprint=
work page 2026
-
[26]
George Kingsley Zipf , title =
-
[27]
DataComp-LM: In search of the next generation of training sets for language models , author=. 2025 , eprint=
work page 2025
-
[28]
Nemotron-CLIMB: CLustering-based Iterative Data Mixture Bootstrapping for Language Model Pre-training , author=. 2025 , eprint=
work page 2025
-
[29]
Anthropic , year=. Claude
-
[30]
Orevaoghene Ahia, Sachin Kumar, Hila Gonen, Jungo Kasai, David Mortensen, Noah Smith, and Yulia Tsvetkov. 2023. https://doi.org/10.18653/v1/2023.emnlp-main.614 Do all languages cost the same? tokenization in the era of commercial language models . In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 9904--9923, ...
-
[31]
Mehdi Ali, Michael Fromm, Klaudia Thellmann, Richard Rutmann, Max L \"u bbering, Johannes Leveling, Katrin Klug, Jan Ebert, Niclas Doll, Jasper Buschhoff, Charvi Jain, Alexander Weber, Lena Jurkschat, Hammam Abdelwahab, Chelsea John, Pedro Ortiz Suarez, Malte Ostendorff, Samuel Weinbach, Rafet Sifa, and 2 others. 2024. https://doi.org/10.18653/v1/2024.fin...
-
[32]
Anthropic. 2026. https://www.anthropic.com/claude Claude O pus 4.6
work page 2026
-
[33]
Catherine Arnett, Tyler A. Chang, and Benjamin Bergen. 2024. https://aclanthology.org/2024.sigul-1.1/ A bit of a problem: Measurement disparities in dataset sizes across languages . In Proceedings of the 3rd Annual Meeting of the Special Interest Group on Under-resourced Languages @ LREC-COLING 2024, pages 1--9, Torino, Italia. ELRA and ICCL
work page 2024
-
[34]
Duygu Ataman and Marcello Federico. 2018. https://doi.org/10.18653/v1/P18-2049 Compositional representation of morphologically-rich input for neural machine translation . In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 305--311, Melbourne, Australia. Association for Computational L...
-
[35]
Kaj Bostrom and Greg Durrett. 2020. https://doi.org/10.18653/v1/2020.findings-emnlp.414 Byte pair encoding is suboptimal for language model pretraining . In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 4617--4624, Online. Association for Computational Linguistics
-
[36]
Michael L. Bynum, Gabriel A. Hackebeil, William E. Hart, Carl D. Laird, Bethany L. Nicholson, John D. Siirola, Jean-Paul Watson, and David L. Woodruff. 2021. Pyomo--optimization modeling in python, third edition, volume 67. Springer Science & Business Media
work page 2021
- [37]
-
[38]
Marco Cognetta, Vil \'e m Zouhar, Sangwhan Moon, and Naoaki Okazaki. 2024. https://aclanthology.org/2024.lrec-main.1469/ Two counterexamples to tokenization and the noiseless channel . In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 16897--16906, Torino, It...
work page 2024
-
[39]
Gautier Dagan, Gabriel Synnaeve, and Baptiste Roziere. 2024. https://proceedings.mlr.press/v235/dagan24a.html Getting the most out of your tokenizer for pre-training and domain adaptation . In Proceedings of the 41st International Conference on Machine Learning, volume 235 of Proceedings of Machine Learning Research, pages 9784--9805. PMLR
work page 2024
-
[40]
Shizhe Diao, Yu Yang, Yonggan Fu, Xin Dong, Dan Su, Markus Kliegl, Zijia Chen, Peter Belcak, Yoshi Suhara, Hongxu Yin, Mostofa Patwary, Yingyan, Lin, Jan Kautz, and Pavlo Molchanov. 2025. https://arxiv.org/abs/2504.13161 Nemotron-climb: Clustering-based iterative data mixture bootstrapping for language model pre-training . Preprint, arXiv:2504.13161
- [41]
- [42]
-
[43]
Philip Gage. 1994. A new algorithm for data compression. The C Users Journal, 12(2):23--38
work page 1994
-
[44]
Matthias Gall \'e . 2019. https://doi.org/10.18653/v1/D19-1141 Investigating the effectiveness of BPE : The power of shorter sequences . In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 1375--1381, Hong Kong, China. Asso...
-
[45]
Michael R. Garey and David S. Johnson. 1979. Computers and Intractability: A Guide to the Theory of NP -Completeness . W. H. Freeman
work page 1979
-
[46]
Omer Goldman, Avi Caciularu, Matan Eyal, Kris Cao, Idan Szpektor, and Reut Tsarfaty. 2024. https://doi.org/10.18653/v1/2024.findings-acl.134 Unpacking tokenization: Evaluating text compression and its correlation with model performance . In Findings of the Association for Computational Linguistics: ACL 2024, pages 2274--2286, Bangkok, Thailand. Associatio...
-
[47]
William E Hart, Jean-Paul Watson, and David L Woodruff. 2011. Pyomo: modeling and solving mathematical programs in python. Mathematical Programming Computation, 3(3):219--260
work page 2011
-
[48]
Valentin Hofmann, Janet Pierrehumbert, and Hinrich Sch \"u tze. 2021. https://doi.org/10.18653/v1/2021.acl-long.279 Superbizarre is not superb: Derivational morphology improves BERT ' s interpretation of complex words . In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on...
-
[49]
Valentin Hofmann, Hinrich Schuetze, and Janet Pierrehumbert. 2022. https://doi.org/10.18653/v1/2022.acl-short.43 An embarrassingly simple method to mitigate undesirable properties of pretrained language model tokenizers . In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 385--393, Du...
-
[50]
Q. Huangfu and J. A. J. Hall. 2018. https://doi.org/10.1007/s12532-017-0130-5 Parallelizing the dual revised simplex method . Mathematical Programming Computation, 10(1):119--142
-
[51]
Eugene Jang, Kimin Lee, Jin-Woo Chung, Keuntae Park, and Seungwon Shin. 2025. https://doi.org/10.18653/v1/2025.emnlp-main.919 Improbable bigrams expose vulnerabilities of incomplete tokens in byte-level tokenizers . In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 18209--18216, Suzhou, China. Association for...
-
[52]
Richard M. Karp. 1972. Reducibility among combinatorial problems. In Raymond E. Miller and James W. Thatcher, editors, Complexity of Computer Computations, pages 85--103. Plenum Press, New York
work page 1972
- [53]
-
[54]
Taku Kudo. 2018. https://doi.org/10.18653/v1/P18-1007 Subword regularization: Improving neural network translation models with multiple subword candidates . In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 66--75, Melbourne, Australia. Association for Computational Linguistics
- [55]
-
[56]
Sander Land and Max Bartolo. 2024. https://doi.org/10.18653/v1/2024.emnlp-main.649 Fishing for magikarp: Automatically detecting under-trained tokens in large language models . In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 11631--11646, Miami, Florida, USA. Association for Computational Linguistics
-
[57]
Jeffrey Li, Alex Fang, Georgios Smyrnis, Maor Ivgi, Matt Jordan, Samir Gadre, Hritik Bansal, Etash Guha, Sedrick Keh, Kushal Arora, Saurabh Garg, Rui Xin, Niklas Muennighoff, Reinhard Heckel, Jean Mercat, Mayee Chen, Suchin Gururangan, Mitchell Wortsman, Alon Albalak, and 40 others. 2025. https://arxiv.org/abs/2406.11794 Datacomp-lm: In search of the next...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[58]
Jia Peng Lim, Shawn Tan, Davin Choo, and Hady W. Lauw. 2025. A partition cover approach to tokenization. In The Thirty-ninth Annual Conference on Neural Information Processing Systems
work page 2025
-
[59]
Tomasz Limisiewicz, Terra Blevins, Hila Gonen, Orevaoghene Ahia, and Luke Zettlemoyer. 2024. https://doi.org/10.18653/v1/2024.acl-long.804 MYTE : Morphology-driven byte encoding for better and fairer multilingual language modeling . In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15...
-
[60]
Alisa Liu, Jonathan Hayase, Valentin Hofmann, Sewoong Oh, Noah A. Smith, and Yejin Choi. 2025. https://openreview.net/forum?id=lcDRvffeNP Super BPE : Space travel for language models . In Second Conference on Language Modeling
work page 2025
-
[61]
Lundin, Ada Zhang, Nihal Karim, Hamza Louzan, Guohao Wei, David Ifeoluwa Adelani, and Cody Carroll
Jessica M. Lundin, Ada Zhang, Nihal Karim, Hamza Louzan, Guohao Wei, David Ifeoluwa Adelani, and Cody Carroll. 2026. https://doi.org/10.18653/v1/2026.africanlp-main.10 The token tax: Systematic bias in multilingual tokenization . In Proceedings of the 7th Workshop on African Natural Language Processing (AfricaNLP 2026), page 103–112. Association for Compu...
-
[62]
Thuat Nguyen, Chien Van Nguyen, Viet Dac Lai, Hieu Man, Nghia Trung Ngo, Franck Dernoncourt, Ryan A. Rossi, and Thien Huu Nguyen. 2024. https://aclanthology.org/2024.lrec-main.377/ C ultura X : A cleaned, enormous, and multilingual dataset for large language models in 167 languages . In Proceedings of the 2024 Joint International Conference on Computation...
work page 2024
-
[63]
Zhang, Coleman Haley, Kenneth Steimel, Han Liu, and Lane Schwartz
Hyunji Hayley Park, Katherine J. Zhang, Coleman Haley, Kenneth Steimel, Han Liu, and Lane Schwartz. 2021. https://doi.org/10.1162/tacl_a_00365 Morphology matters: A multilingual language modeling analysis . Transactions of the Association for Computational Linguistics, 9:261--276
- [64]
-
[65]
Varshini Reddy, Craig W Schmidt, Yuval Pinter, and Chris Tanner. 2026. https://openreview.net/forum?id=IETQ36gehE How much is enough? the diminishing returns of tokenization training data . In Tokenization Workshop
work page 2026
-
[66]
Phillip Rust, Jonas Pfeiffer, Ivan Vuli \'c , Sebastian Ruder, and Iryna Gurevych. 2021. https://doi.org/10.18653/v1/2021.acl-long.243 How good is your tokenizer? on the monolingual performance of multilingual language models . In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Confe...
-
[67]
Craig W Schmidt, Varshini Reddy, Chris Tanner, and Yuval Pinter. 2025. https://openreview.net/forum?id=oPAjXGV8qQ Boundless byte pair encoding: Breaking the pre-tokenization barrier . In Second Conference on Language Modeling
work page 2025
-
[68]
Craig W Schmidt, Varshini Reddy, Haoran Zhang, Alec Alameddine, Omri Uzan, Yuval Pinter, and Chris Tanner. 2024. https://doi.org/10.18653/v1/2024.emnlp-main.40 Tokenization is more than compression . In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 678--702, Miami, Florida, USA. Association for Computational...
-
[69]
Craig W. Schmidt, Chris Tanner, and Yuval Pinter. 2026. https://arxiv.org/abs/2604.05192 Faster superword tokenization . Preprint, arXiv:2604.05192
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[70]
Mike Schuster and Kaisuke Nakajima. 2012. https://doi.org/10.1109/ICASSP.2012.6289079 Japanese and korean voice search . In 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5149--5152
-
[71]
Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. https://doi.org/10.18653/v1/P16-1009 Improving neural machine translation models with monolingual data . In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 86--96, Berlin, Germany. Association for Computational Linguistics
- [72]
-
[73]
Schmidt, Chris Tanner, and Yuval Pinter
Omri Uzan, Craig W. Schmidt, Chris Tanner, and Yuval Pinter. 2024. https://doi.org/10.18653/v1/2024.acl-short.73 Greed is all you need: An evaluation of tokenizer inference methods . In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 813--822, Bangkok, Thailand. Association for Comput...
-
[74]
Philip Whittington, Gregor Bachmann, and Tiago Pimentel. 2025. https://doi.org/10.18653/v1/2025.acl-long.1365 Tokenisation is NP -complete . In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 28133--28153, Vienna, Austria. Association for Computational Linguistics
-
[75]
Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, Jeff Klingner, Apurva Shah, Melvin Johnson, Xiaobing Liu, Łukasz Kaiser, Stephan Gouws, Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, and 12 others. 2016. https://arxiv.org/abs/1609.08144 Google's neural machine translatio...
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[76]
Shaked Yehezkel and Yuval Pinter. 2023. https://doi.org/10.18653/v1/2023.eacl-main.45 Incorporating context into subword vocabularies . In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 623--635, Dubrovnik, Croatia. Association for Computational Linguistics
-
[77]
George Kingsley Zipf. 1949. Human Behavior and the Principle of Least Effort. Addison-Wesley
work page 1949
-
[78]
Vil \'e m Zouhar, Clara Meister, Juan Gastaldi, Li Du, Mrinmaya Sachan, and Ryan Cotterell. 2023. https://doi.org/10.18653/v1/2023.acl-long.284 Tokenization and the noiseless channel . In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5184--5207, Toronto, Canada. Association for Compu...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.