Tokenisation via Convex Relaxations
Pith reviewed 2026-05-22 05:16 UTC · model grok-4.3
The pith
Tokenization can be cast as a linear program whose solution yields vocabularies within 1% of optimal.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By relaxing the discrete problem of choosing a vocabulary of fixed size into a linear program, ConvexTok finds tokenizers that improve bits-per-byte on language models and lie within 1% of the lower bound on the objective at typical vocabulary sizes; the same construction also improves most intrinsic tokenization metrics while producing mixed results on downstream tasks.
What carries the argument
ConvexTok, the solution of a linear program that relaxes vocabulary selection into a convex objective over token frequencies and coverage constraints.
If this is right
- Language models achieve lower bits-per-byte with ConvexTok vocabularies than with BPE or Unigram at equal vocabulary size.
- Intrinsic tokenization metrics such as fertility and coverage improve consistently.
- A computable lower bound certifies that the obtained tokenizer is within 1% of optimal under the linear objective.
- Downstream task accuracy gains appear but are smaller and less consistent than the efficiency gains.
Where Pith is reading between the lines
- The same linear-programming framing could be applied to other preprocessing stages such as sentence segmentation or morphological segmentation.
- If the objective can be made differentiable, the relaxation might be inserted directly into end-to-end model training.
- Multilingual settings could benefit from joint optimization over multiple languages inside one linear program.
Load-bearing premise
The linear objective and its relaxation are assumed to be a faithful proxy for what actually improves language-model performance.
What would settle it
Train identical language models on corpora tokenized by ConvexTok versus BPE at the same vocabulary size and measure whether the bits-per-byte gap disappears or reverses.
Figures
read the original abstract
Tokenisation is an integral part of the current NLP pipeline. Current tokenisation algorithms such as BPE and Unigram are greedy algorithms -- they make locally optimal decisions without considering the resulting vocabulary as a whole. We instead formulate tokeniser construction as a linear program and solve it using convex optimisation tools, yielding a new algorithm we call ConvexTok. We find ConvexTok consistently improves intrinsic tokenisation metrics and the bits-per-byte (BpB) achieved by language models; it also improves downstream task performance, but less consistently. Furthermore, ConvexTok allows the user to certify how far their tokeniser is from optimal, with respect to a certain objective, via a lower bound, and we empirically find it to be within 1\% of optimal at common vocabulary sizes.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents ConvexTok, a new tokenization method that formulates the construction of the vocabulary as a linear program solved using convex optimization techniques. Unlike greedy algorithms such as BPE and Unigram, it considers the vocabulary as a whole. The authors report that ConvexTok consistently improves intrinsic tokenization metrics and the bits-per-byte (BpB) of language models, with less consistent improvements on downstream tasks. Additionally, it provides a lower bound to certify the distance to optimality with respect to the LP objective, empirically showing solutions within 1% of optimal for common vocabulary sizes.
Significance. If the results hold and the LP objective is a good proxy for LM performance, this work introduces a principled, optimization-based approach to tokenization with the novel feature of optimality certificates. This could be significant for the field as it moves beyond heuristic methods. The use of convex relaxations and lower bounds is a strength that allows for verifiable claims.
major comments (2)
- Abstract: The claim that ConvexTok 'consistently improves' BpB and downstream performance, and is 'within 1% of optimal', requires more detail on the specific LP formulation, the definition of the objective function, data exclusion rules, and statistical significance tests to rule out post-hoc fitting or selection effects.
- Method: The linear programming relaxation and chosen objective function need to be validated as a faithful model of tokenizer quality for language modeling. No ablation is mentioned that compares the LP objective value to the actual negative log-likelihood of a trained transformer or holds data and model fixed while varying the tokenizer objective.
minor comments (1)
- Abstract: The abstract mentions 'a certain objective' for the optimality certificate; this should be explicitly defined early in the paper.
Simulated Author's Rebuttal
Thank you for the referee's insightful comments on our work. We respond to each major comment below and indicate the revisions we plan to make.
read point-by-point responses
-
Referee: Abstract: The claim that ConvexTok 'consistently improves' BpB and downstream performance, and is 'within 1% of optimal', requires more detail on the specific LP formulation, the definition of the objective function, data exclusion rules, and statistical significance tests to rule out post-hoc fitting or selection effects.
Authors: We concur that providing more detail in the abstract would strengthen the presentation of our results. In the revised manuscript, we will elaborate on the LP formulation and the definition of the objective function. We will also specify the data exclusion rules applied in our experiments and include statistical significance tests for the reported improvements in BpB and downstream performance to address potential concerns about selection effects. The claim of being within 1% of optimal is supported by the lower bound computation described in the paper, and we will ensure this is clearly contextualized. revision: yes
-
Referee: Method: The linear programming relaxation and chosen objective function need to be validated as a faithful model of tokenizer quality for language modeling. No ablation is mentioned that compares the LP objective value to the actual negative log-likelihood of a trained transformer or holds data and model fixed while varying the tokenizer objective.
Authors: This is a fair point regarding the validation of our objective. Our LP formulation optimizes a specific objective that we argue serves as a reasonable proxy for tokenizer quality, as demonstrated by the improvements in intrinsic metrics and BpB. However, we acknowledge that a direct ablation comparing the LP objective to the NLL of a fixed transformer model was not included. We will add a section in the revised paper discussing the motivation for the chosen objective and its empirical correlation with LM performance. A full ablation study may be included if it can be completed within the revision timeline, but we note that such experiments are computationally intensive. We disagree that this invalidates the current contributions, as the optimality certificate is a novel aspect independent of this validation. revision: partial
Circularity Check
No circularity: LP formulation and empirical gains are independent
full rationale
The derivation introduces a linear-programming relaxation of vocabulary selection as a novel modeling choice solved by standard convex-optimization tools; the resulting tokenizers are then evaluated on separate, externally measured quantities (BpB of trained language models and downstream task accuracy). No equation equates a claimed improvement or optimality gap to a quantity defined by the same fitted parameters or by a self-citation chain; the 1% optimality certificate is explicitly relative to the paper's own LP objective, while BpB and task gains are reported from independent training runs. The method is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Tokenization construction admits a linear programming formulation whose relaxation is solvable by convex optimization tools
Reference graph
Works this paper leans on
-
[2]
Schmidt, Craig W. and Reddy, Varshini and Zhang, Haoran and Alameddine, Alec and Uzan, Omri and Pinter, Yuval and Tanner, Chris. Tokenization Is More Than Compression. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.40
-
[3]
Williamson, David P. and Shmoys, David B. , title =. 2011 , isbn =
work page 2011
-
[4]
L. R. Ford and D. R. Fulkerson , publisher =. Flows in Networks , urldate =
-
[5]
George B. Dantzig , journal =. On the Shortest Route Through a Network , urldate =
-
[6]
NLLB Team and Costa-juss \`a , Marta R. and Cross, James and C elebi, Onur and Elbayad, Maha and Heafield, Kenneth and Heffernan, Kevin and Kalbassi, Elahe and Lam, Janice and Licht, Daniel and Maillard, Jean and Sun, Anna and Wang, Skyler and Wenzek, Guillaume and Youngblood, Al and Akula, Bapi and Barrault, Loic and Gonzalez, Gabriel Mejia and Hansanti,...
-
[7]
The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=
A Partition Cover Approach to Tokenization , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=
-
[8]
The Fourteenth International Conference on Learning Representations , year=
Tokenisation over Bounded Alphabets is Hard , author=. The Fourteenth International Conference on Learning Representations , year=
- [9]
-
[10]
Karger and Debmalya Panigrahi , title =
Mohsen Ghaffari and David R. Karger and Debmalya Panigrahi , title =. Proceedings of the 2017 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA) , chapter =. 2017 , publisher=. doi:10.1137/1.9781611974782.71 , URL =
- [11]
-
[12]
Meister, Clara , year =
-
[13]
Whittington, Philip and Bachmann, Gregor and Pimentel, Tiago. Tokenisation is NP -complete. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025. doi:10.18653/v1/2025.acl-long.1365
- [14]
-
[15]
Jeffrey Li and Alex Fang and Georgios Smyrnis and Maor Ivgi and Matt Jordan and Samir Yitzhak Gadre and Hritik Bansal and Etash Kumar Guha and Sedrick Keh and Kushal Arora and Saurabh Garg and Rui Xin and Niklas Muennighoff and Reinhard Heckel and Jean Mercat and Mayee F Chen and Suchin Gururangan and Mitchell Wortsman and Alon Albalak and Yonatan Bitton ...
work page 2024
- [16]
-
[17]
Gage, Philip , title =. C Users Journal , month =. 1994 , issue_date =
work page 1994
-
[18]
Neural Machine Translation of Rare Words with Subword Units
Sennrich, Rico and Haddow, Barry and Birch, Alexandra. Neural Machine Translation of Rare Words with Subword Units. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2016. doi:10.18653/v1/P16-1162
-
[19]
Applied Artificial Intelligence , volume =
Sanghyun Choo and Wonjoon Kim , title =. Applied Artificial Intelligence , volume =. 2023 , publisher =. doi:10.1080/08839514.2023.2175112 , URL =
-
[20]
A Formal Perspective on Byte-Pair Encoding
Zouhar, Vil \'e m and Meister, Clara and Gastaldi, Juan and Du, Li and Vieira, Tim and Sachan, Mrinmaya and Cotterell, Ryan. A Formal Perspective on Byte-Pair Encoding. Findings of the Association for Computational Linguistics: ACL 2023. 2023. doi:10.18653/v1/2023.findings-acl.38
-
[21]
Tokenization and the Noiseless Channel
Zouhar, Vil \'e m and Meister, Clara and Gastaldi, Juan and Du, Li and Sachan, Mrinmaya and Cotterell, Ryan. Tokenization and the Noiseless Channel. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023. doi:10.18653/v1/2023.acl-long.284
-
[22]
Investigating the Effectiveness of BPE : The Power of Shorter Sequences
Gall \'e , Matthias. Investigating the Effectiveness of BPE : The Power of Shorter Sequences. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2019. doi:10.18653/v1/D19-1141
-
[23]
Finding the Optimal Vocabulary Size for Neural Machine Translation
Gowda, Thamme and May, Jonathan. Finding the Optimal Vocabulary Size for Neural Machine Translation. Findings of the Association for Computational Linguistics: EMNLP 2020. 2020. doi:10.18653/v1/2020.findings-emnlp.352
-
[24]
Two Counterexamples to Tokenization and the Noiseless Channel
Cognetta, Marco and Zouhar, Vil \'e m and Moon, Sangwhan and Okazaki, Naoaki. Two Counterexamples to Tokenization and the Noiseless Channel. Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024). 2024
work page 2024
-
[25]
Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates
Kudo, Taku. Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2018. doi:10.18653/v1/P18-1007
-
[26]
The Twelfth International Conference on Learning Representations , year=
Language Modeling Is Compression , author=. The Twelfth International Conference on Learning Representations , year=
-
[27]
Tokenizer Choice For LLM Training: Negligible or Crucial?
Ali, Mehdi and Fromm, Michael and Thellmann, Klaudia and Rutmann, Richard and L. Tokenizer Choice For LLM Training: Negligible or Crucial?. Findings of the Association for Computational Linguistics: NAACL 2024. 2024. doi:10.18653/v1/2024.findings-naacl.247
-
[28]
Unpacking Tokenization: Evaluating Text Compression and its Correlation with Model Performance
Goldman, Omer and Caciularu, Avi and Eyal, Matan and Cao, Kris and Szpektor, Idan and Tsarfaty, Reut. Unpacking Tokenization: Evaluating Text Compression and its Correlation with Model Performance. Findings of the Association for Computational Linguistics: ACL 2024. 2024. doi:10.18653/v1/2024.findings-acl.134
-
[29]
OpenAI , year=. 2303.08774 , archivePrefix=
work page internal anchor Pith review Pith/arXiv arXiv
-
[30]
Hugo Touvron and Thibaut Lavril and Gautier Izacard and Xavier Martinet and Marie-Anne Lachaux and Timothée Lacroix and Baptiste Rozière and Naman Goyal and Eric Hambro and Faisal Azhar and Aurelien Rodriguez and Armand Joulin and Edouard Grave and Guillaume Lample , url=
-
[31]
Biderman, Stella and Schoelkopf, Hailey and Anthony, Quentin and Bradley, Herbie and O'Brien, Kyle and Hallahan, Eric and Khan, Mohammad Aflah and Purohit, Shivanshu and Prashanth, USVSN Sai and Raff, Edward and Skowron, Aviya and Sutawika, Lintang and Van Der Wal, Oskar , booktitle =. Pythia:
-
[32]
Second Conference on Language Modeling , year=
Boundless Byte Pair Encoding: Breaking the Pre-tokenization Barrier , author=. Second Conference on Language Modeling , year=
-
[33]
Smith and Yejin Choi , booktitle=
Alisa Liu and Jonathan Hayase and Valentin Hofmann and Sewoong Oh and Noah A. Smith and Yejin Choi , booktitle=. Super. 2025 , url=
work page 2025
-
[34]
Dijkstra, E. W. , title =. Numerische Mathematik , year =
-
[35]
Reducibility among combinatorial problems
Karp, Richard M. Reducibility among Combinatorial Problems. Complexity of Computer Computations: Proceedings of a symposium on the Complexity of Computer Computations, held March 20--22, 1972, at the IBM Thomas J. Watson Research Center, Yorktown Heights, New York, and sponsored by the Office of Naval Research, Mathematics Program, IBM World Trade Corpora...
-
[36]
Randomized Algorithms , publisher=
Motwani, Rajeev and Raghavan, Prabhakar , year=. Randomized Algorithms , publisher=
-
[37]
Shmoys, David B. and Tardos,. An approximation algorithm for the generalized assignment problem , url =. Mathematical Programming , number =. 1993 , bdsk-url-1 =. doi:10.1007/BF01585178 , id =
- [38]
-
[39]
and Steiglitz, Kenneth , title =
Papadimitriou, Christos H. and Steiglitz, Kenneth , title =. 1982 , isbn =
work page 1982
-
[40]
Proceedings of the twentieth annual ACM symposium on Theory of computing , year=
Expressing combinatorial optimization problems by linear programs , author=. Proceedings of the twentieth annual ACM symposium on Theory of computing , year=
-
[41]
Extension Complexity of Independent Set Polytopes , journal =
G\". Extension Complexity of Independent Set Polytopes , journal =. 2018 , doi =. https://doi.org/10.1137/16M109884X , abstract =
-
[42]
Practical large-scale linear programming using primal-dual hybrid gradient , year =
Applegate, David and D\'. Practical large-scale linear programming using primal-dual hybrid gradient , year =. Proceedings of the 35th International Conference on Neural Information Processing Systems , articleno =
- [43]
-
[44]
Scaling Laws for Neural Language Models
Kaplan, Jared and McCandlish, Sam and Henighan, Tom and Brown, Tom B. and Chess, Benjamin and Child, Rewon and Gray, Scott and Radford, Alec and Wu, Jeffrey and Amodei, Dario. Scaling Laws for Neural Language Models. 2020. arXiv:2001.08361
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[45]
Hoffmann, Jordan and Borgeaud, Sebastian and Mensch, Arthur and Buchatskaya, Elena and Cai, Trevor and Rutherford, Eliza and de Las Casas, Diego and Hendricks, Lisa Anne and Welbl, Johannes and Clark, Aidan and Hennigan, Tom and Noland, Eric and Millican, Katie and van den Driessche, George and Damoc, Bogdan and Guy, Aurelia and Osindero, Simon and Simony...
work page 2022
-
[46]
David L. Applegate and Robert E. Bixby and Vašek Chvatál and William J. Cook , publisher =. The Traveling Salesman Problem: A Computational Study , urldate =
-
[47]
BPE Gets Picky: Efficient Vocabulary Refinement During Tokenizer Training
Chizhov, Pavel and Arnett, Catherine and Korotkova, Elizaveta and Yamshchikov, Ivan P. BPE Gets Picky: Efficient Vocabulary Refinement During Tokenizer Training. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.925
-
[48]
Hoffman, Alan J. and Kruskal, Joseph B. Integral Boundary Points of Convex Polyhedra. 50 Years of Integer Programming 1958-2008: From the Early Years to the State-of-the-Art. 2010. doi:10.1007/978-3-540-68279-0_3
-
[49]
Lian, Haoran and Xiong, Yizhe and Niu, Jianwei and Mo, Shasha and Su, Zhenpeng and Lin, Zijia and Chen, Hui and Han, Jungong and Ding, Guiguang , volume=. Scaffold-. Proceedings of the AAAI Conference on Artificial Intelligence , year=. doi:10.1609/aaai.v39i23.34633 , number=
-
[50]
Proceedings of the Fifth Workshop on Insights from Negative Results in NLP , doi =
Cognetta, Marco and Hiraoka, Tatsuya and Sennrich, Rico and Pinter, Yuval and Okazaki, Naoaki , title =. Proceedings of the Fifth Workshop on Insights from Negative Results in NLP , doi =. 2024 , month = jun, pages =
work page 2024
-
[51]
Paths and cycles in colored graphs
Hajo Broersma and Xueliang Li and Gerhard Woeginger and Shenggui Zhang. Paths and cycles in colored graphs. Australasian journal of combinatorics. 2005
work page 2005
-
[52]
Journal of Combinatorial Optimization , year=
Approximation algorithms and hardness results for labeled connectivity problems , author=. Journal of Combinatorial Optimization , year=
-
[53]
Approximation and hardness results for label cut and related problems , volume =
Zhang, Peng and Cai, Jin-Yi and Tang, Lin-Qing and Zhao, Wen-Bo , year =. Approximation and hardness results for label cut and related problems , volume =. Journal of Combinatorial Optimization , doi =
- [54]
-
[55]
Forsythe, Alasdair , year =
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.