Tokenisation via Convex Relaxations

Craig W. Schmidt; Dennis Komm; Jan Tempus; Philip Whittington; Tiago Pimentel

arxiv: 2605.22821 · v1 · pith:A3HSD7RNnew · submitted 2026-05-21 · 💻 cs.CL · cs.LG

Tokenisation via Convex Relaxations

Jan Tempus , Philip Whittington , Craig W. Schmidt , Dennis Komm , Tiago Pimentel This is my paper

Pith reviewed 2026-05-22 05:16 UTC · model grok-4.3

classification 💻 cs.CL cs.LG

keywords tokenizationconvex optimizationlinear programminglanguage modelingBPEvocabulary selectionbits-per-byte

0 comments

The pith

Tokenization can be cast as a linear program whose solution yields vocabularies within 1% of optimal.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper replaces greedy algorithms such as BPE and Unigram with a formulation of vocabulary selection as a linear program that is solved by convex optimization. The resulting method, ConvexTok, produces tokenizers that improve intrinsic metrics and the bits-per-byte achieved by language models. It also supplies a lower bound that certifies how close any given tokenizer is to the optimum under the chosen objective. A reader would care because the approach turns an opaque design choice into a verifiable optimization problem that can be tightened or relaxed as needed.

Core claim

By relaxing the discrete problem of choosing a vocabulary of fixed size into a linear program, ConvexTok finds tokenizers that improve bits-per-byte on language models and lie within 1% of the lower bound on the objective at typical vocabulary sizes; the same construction also improves most intrinsic tokenization metrics while producing mixed results on downstream tasks.

What carries the argument

ConvexTok, the solution of a linear program that relaxes vocabulary selection into a convex objective over token frequencies and coverage constraints.

If this is right

Language models achieve lower bits-per-byte with ConvexTok vocabularies than with BPE or Unigram at equal vocabulary size.
Intrinsic tokenization metrics such as fertility and coverage improve consistently.
A computable lower bound certifies that the obtained tokenizer is within 1% of optimal under the linear objective.
Downstream task accuracy gains appear but are smaller and less consistent than the efficiency gains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same linear-programming framing could be applied to other preprocessing stages such as sentence segmentation or morphological segmentation.
If the objective can be made differentiable, the relaxation might be inserted directly into end-to-end model training.
Multilingual settings could benefit from joint optimization over multiple languages inside one linear program.

Load-bearing premise

The linear objective and its relaxation are assumed to be a faithful proxy for what actually improves language-model performance.

What would settle it

Train identical language models on corpora tokenized by ConvexTok versus BPE at the same vocabulary size and measure whether the bits-per-byte gap disappears or reverses.

Figures

Figures reproduced from arXiv: 2605.22821 by Craig W. Schmidt, Dennis Komm, Jan Tempus, Philip Whittington, Tiago Pimentel.

**Figure 1.** Figure 1: Tokenisation graph constructed from D = {abaa, aba}. Black edges represent Ebyte and others represent Etok. Our paper’s main contribution is an LP-based tokenisation algorithm which directly approximates a globally-optimal tokeniser; thus, avoiding locally-optimal solutions. As our tokeniser relies on a convex relation, we term it ConvexTok. Before we get to our LP, however, we first express the problem … view at source ↗

**Figure 2.** Figure 2: The Det (left), Bias (center), and Int (right) rounding schemes. of the token it represents (c/length(c), with a slight abuse of notation to let length(c) denote the length of the corresponding token), and then rounds these values (to 1 or 0) as before. This biases the selection towards shorter tokens when LP scores are comparable, and is motivated by the fact that shorter tokens are more likely to occur o… view at source ↗

**Figure 3.** Figure 3: Average Jaccard similarity between vocabularies when retraining a tokeniser on indepen [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Compression by the different tokenisers. (Left) absolute and (Right) relative to [PITH_FULL_IMAGE:figures/full_fig_p016_4.png] view at source ↗

**Figure 5.** Figure 5: Vocabulary utilisation by different tokenisers. (Left) absolute and (Right) relative to [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗

**Figure 6.** Figure 6: Type-token ratio by different tokenisers. (Left) absolute and (Right) relative to [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗

**Figure 7.** Figure 7: Token length of different tokenisers. (Left) absolute and (Right) relative to [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗

**Figure 8.** Figure 8: Tokens per line of different tokenisers. (Left) absolute and (Right) relative to [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗

**Figure 9.** Figure 9: Shannon entropy of different tokenisers. (Left) absolute and (Right) relative to [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗

**Figure 10.** Figure 10: Rényi entropy of different tokenisers. (Left) absolute and (Right) relative to [PITH_FULL_IMAGE:figures/full_fig_p017_10.png] view at source ↗

**Figure 11.** Figure 11: Average rank of different tokenisers. (Left) absolute and (Right) relative to [PITH_FULL_IMAGE:figures/full_fig_p017_11.png] view at source ↗

**Figure 12.** Figure 12: Compression by the different tokenisers. (Left) absolute and (Right) relative to [PITH_FULL_IMAGE:figures/full_fig_p018_12.png] view at source ↗

**Figure 13.** Figure 13: Vocabulary utilisation by different tokenisers. (Left) absolute and (Right) relative to [PITH_FULL_IMAGE:figures/full_fig_p018_13.png] view at source ↗

**Figure 14.** Figure 14: Type-token ratio by different tokenisers. (Left) absolute and (Right) relative to [PITH_FULL_IMAGE:figures/full_fig_p018_14.png] view at source ↗

**Figure 15.** Figure 15: Token length of different tokenisers. (Left) absolute and (Right) relative to [PITH_FULL_IMAGE:figures/full_fig_p018_15.png] view at source ↗

**Figure 16.** Figure 16: Tokens per line of different tokenisers. (Left) absolute and (Right) relative to [PITH_FULL_IMAGE:figures/full_fig_p019_16.png] view at source ↗

**Figure 17.** Figure 17: Shannon entropy of different tokenisers. (Left) absolute and (Right) relative to [PITH_FULL_IMAGE:figures/full_fig_p019_17.png] view at source ↗

**Figure 18.** Figure 18: Rényi entropy of different tokenisers. (Left) absolute and (Right) relative to [PITH_FULL_IMAGE:figures/full_fig_p019_18.png] view at source ↗

**Figure 19.** Figure 19: Average rank of different tokenisers. (Left) absolute and (Right) relative to [PITH_FULL_IMAGE:figures/full_fig_p019_19.png] view at source ↗

**Figure 20.** Figure 20: Compression by the different tokenisers. (Left) absolute and (Right) relative to [PITH_FULL_IMAGE:figures/full_fig_p020_20.png] view at source ↗

**Figure 21.** Figure 21: Vocabulary utilisation by different tokenisers. (Left) absolute and (Right) relative to [PITH_FULL_IMAGE:figures/full_fig_p020_21.png] view at source ↗

**Figure 22.** Figure 22: Type-token ratio by different tokenisers. (Left) absolute and (Right) relative to [PITH_FULL_IMAGE:figures/full_fig_p020_22.png] view at source ↗

**Figure 23.** Figure 23: Token length of different tokenisers. (Left) absolute and (Right) relative to [PITH_FULL_IMAGE:figures/full_fig_p020_23.png] view at source ↗

**Figure 24.** Figure 24: Tokens per line of different tokenisers. (Left) absolute and (Right) relative to [PITH_FULL_IMAGE:figures/full_fig_p021_24.png] view at source ↗

**Figure 25.** Figure 25: Shannon entropy of different tokenisers. (Left) absolute and (Right) relative to [PITH_FULL_IMAGE:figures/full_fig_p021_25.png] view at source ↗

**Figure 26.** Figure 26: Rényi entropy of different tokenisers. (Left) absolute and (Right) relative to [PITH_FULL_IMAGE:figures/full_fig_p021_26.png] view at source ↗

**Figure 27.** Figure 27: Average rank of different tokenisers. (Left) absolute and (Right) relative to [PITH_FULL_IMAGE:figures/full_fig_p021_27.png] view at source ↗

**Figure 33.** Figure 33: (left) BpB and (right) CORE vs. vocabulary size across models with different depths [PITH_FULL_IMAGE:figures/full_fig_p023_33.png] view at source ↗

**Figure 34.** Figure 34: (left) BpB and (right) CORE vs. vocabulary size across three training seeds. All these models were trained with 12 layers. 23 [PITH_FULL_IMAGE:figures/full_fig_p023_34.png] view at source ↗

read the original abstract

Tokenisation is an integral part of the current NLP pipeline. Current tokenisation algorithms such as BPE and Unigram are greedy algorithms -- they make locally optimal decisions without considering the resulting vocabulary as a whole. We instead formulate tokeniser construction as a linear program and solve it using convex optimisation tools, yielding a new algorithm we call ConvexTok. We find ConvexTok consistently improves intrinsic tokenisation metrics and the bits-per-byte (BpB) achieved by language models; it also improves downstream task performance, but less consistently. Furthermore, ConvexTok allows the user to certify how far their tokeniser is from optimal, with respect to a certain objective, via a lower bound, and we empirically find it to be within 1\% of optimal at common vocabulary sizes.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ConvexTok reframes tokenization as a solvable LP for a global optimum and bound, but the chosen linear objective still needs direct checks against actual LM training loss.

read the letter

The paper's core move is to drop the greedy local choices in BPE and Unigram and instead cast vocabulary selection as a linear program that can be solved with standard convex tools. That produces ConvexTok plus a lower bound that lets them claim the output sits within 1% of optimal for their objective at normal vocabulary sizes. They report consistent lifts on intrinsic tokenization scores and bits-per-byte when the resulting tokenizer is plugged into language model training, with patchier gains on downstream tasks.

Referee Report

2 major / 1 minor

Summary. The paper presents ConvexTok, a new tokenization method that formulates the construction of the vocabulary as a linear program solved using convex optimization techniques. Unlike greedy algorithms such as BPE and Unigram, it considers the vocabulary as a whole. The authors report that ConvexTok consistently improves intrinsic tokenization metrics and the bits-per-byte (BpB) of language models, with less consistent improvements on downstream tasks. Additionally, it provides a lower bound to certify the distance to optimality with respect to the LP objective, empirically showing solutions within 1% of optimal for common vocabulary sizes.

Significance. If the results hold and the LP objective is a good proxy for LM performance, this work introduces a principled, optimization-based approach to tokenization with the novel feature of optimality certificates. This could be significant for the field as it moves beyond heuristic methods. The use of convex relaxations and lower bounds is a strength that allows for verifiable claims.

major comments (2)

Abstract: The claim that ConvexTok 'consistently improves' BpB and downstream performance, and is 'within 1% of optimal', requires more detail on the specific LP formulation, the definition of the objective function, data exclusion rules, and statistical significance tests to rule out post-hoc fitting or selection effects.
Method: The linear programming relaxation and chosen objective function need to be validated as a faithful model of tokenizer quality for language modeling. No ablation is mentioned that compares the LP objective value to the actual negative log-likelihood of a trained transformer or holds data and model fixed while varying the tokenizer objective.

minor comments (1)

Abstract: The abstract mentions 'a certain objective' for the optimality certificate; this should be explicitly defined early in the paper.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the referee's insightful comments on our work. We respond to each major comment below and indicate the revisions we plan to make.

read point-by-point responses

Referee: Abstract: The claim that ConvexTok 'consistently improves' BpB and downstream performance, and is 'within 1% of optimal', requires more detail on the specific LP formulation, the definition of the objective function, data exclusion rules, and statistical significance tests to rule out post-hoc fitting or selection effects.

Authors: We concur that providing more detail in the abstract would strengthen the presentation of our results. In the revised manuscript, we will elaborate on the LP formulation and the definition of the objective function. We will also specify the data exclusion rules applied in our experiments and include statistical significance tests for the reported improvements in BpB and downstream performance to address potential concerns about selection effects. The claim of being within 1% of optimal is supported by the lower bound computation described in the paper, and we will ensure this is clearly contextualized. revision: yes
Referee: Method: The linear programming relaxation and chosen objective function need to be validated as a faithful model of tokenizer quality for language modeling. No ablation is mentioned that compares the LP objective value to the actual negative log-likelihood of a trained transformer or holds data and model fixed while varying the tokenizer objective.

Authors: This is a fair point regarding the validation of our objective. Our LP formulation optimizes a specific objective that we argue serves as a reasonable proxy for tokenizer quality, as demonstrated by the improvements in intrinsic metrics and BpB. However, we acknowledge that a direct ablation comparing the LP objective to the NLL of a fixed transformer model was not included. We will add a section in the revised paper discussing the motivation for the chosen objective and its empirical correlation with LM performance. A full ablation study may be included if it can be completed within the revision timeline, but we note that such experiments are computationally intensive. We disagree that this invalidates the current contributions, as the optimality certificate is a novel aspect independent of this validation. revision: partial

Circularity Check

0 steps flagged

No circularity: LP formulation and empirical gains are independent

full rationale

The derivation introduces a linear-programming relaxation of vocabulary selection as a novel modeling choice solved by standard convex-optimization tools; the resulting tokenizers are then evaluated on separate, externally measured quantities (BpB of trained language models and downstream task accuracy). No equation equates a claimed improvement or optimality gap to a quantity defined by the same fitted parameters or by a self-citation chain; the 1% optimality certificate is explicitly relative to the paper's own LP objective, while BpB and task gains are reported from independent training runs. The method is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that tokenization quality can be usefully expressed as the objective of a linear program whose relaxation yields practical solutions. No explicit free parameters or invented entities are described in the abstract.

axioms (1)

domain assumption Tokenization construction admits a linear programming formulation whose relaxation is solvable by convex optimization tools
This premise is required for the ConvexTok algorithm to exist and is invoked when the abstract contrasts the new method with greedy algorithms.

pith-pipeline@v0.9.0 · 5658 in / 1441 out tokens · 44050 ms · 2026-05-22T05:16:50.783338+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

54 extracted references · 54 canonical work pages · 2 internal anchors

[2]

and Reddy, Varshini and Zhang, Haoran and Alameddine, Alec and Uzan, Omri and Pinter, Yuval and Tanner, Chris

Schmidt, Craig W. and Reddy, Varshini and Zhang, Haoran and Alameddine, Alec and Uzan, Omri and Pinter, Yuval and Tanner, Chris. Tokenization Is More Than Compression. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.40

work page doi:10.18653/v1/2024.emnlp-main.40 2024
[3]

and Shmoys, David B

Williamson, David P. and Shmoys, David B. , title =. 2011 , isbn =

work page 2011
[4]

L. R. Ford and D. R. Fulkerson , publisher =. Flows in Networks , urldate =

work page
[5]

Dantzig , journal =

George B. Dantzig , journal =. On the Shortest Route Through a Network , urldate =

work page
[6]

NLLB Team and Costa-juss \`a , Marta R. and Cross, James and C elebi, Onur and Elbayad, Maha and Heafield, Kenneth and Heffernan, Kevin and Kalbassi, Elahe and Lam, Janice and Licht, Daniel and Maillard, Jean and Sun, Anna and Wang, Skyler and Wenzek, Guillaume and Youngblood, Al and Akula, Bapi and Barrault, Loic and Gonzalez, Gabriel Mejia and Hansanti,...

work page doi:10.1038/s41586-024-07335-x 2024
[7]

The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

A Partition Cover Approach to Tokenization , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

work page
[8]

The Fourteenth International Conference on Learning Representations , year=

Tokenisation over Bounded Alphabets is Hard , author=. The Fourteenth International Conference on Learning Representations , year=

work page
[9]

2024 , eprint=

Theoretical Analysis of Byte-Pair Encoding , author=. 2024 , eprint=

work page 2024
[10]

Karger and Debmalya Panigrahi , title =

Mohsen Ghaffari and David R. Karger and Debmalya Panigrahi , title =. Proceedings of the 2017 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA) , chapter =. 2017 , publisher=. doi:10.1137/1.9781611974782.71 , URL =

work page doi:10.1137/1.9781611974782.71 2017
[11]

2026 , eprint=

Olmo 3 , author=. 2026 , eprint=

work page 2026
[12]

Meister, Clara , year =

work page
[13]

Tokenisation is NP -complete

Whittington, Philip and Bachmann, Gregor and Pimentel, Tiago. Tokenisation is NP -complete. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025. doi:10.18653/v1/2025.acl-long.1365

work page doi:10.18653/v1/2025.acl-long.1365 2025
[14]

2025 , publisher =

Andrej Karpathy , title =. 2025 , publisher =

work page 2025
[15]

Jeffrey Li and Alex Fang and Georgios Smyrnis and Maor Ivgi and Matt Jordan and Samir Yitzhak Gadre and Hritik Bansal and Etash Kumar Guha and Sedrick Keh and Kushal Arora and Saurabh Garg and Rui Xin and Niklas Muennighoff and Reinhard Heckel and Jean Mercat and Mayee F Chen and Suchin Gururangan and Mitchell Wortsman and Alon Albalak and Yonatan Bitton ...

work page 2024
[16]

, title =

Klotz, Ed and Newman, Alexandra M. , title =. Surveys in Operations Research and Management Science , volume =. 2013 , doi =

work page 2013
[17]

C Users Journal , month =

Gage, Philip , title =. C Users Journal , month =. 1994 , issue_date =

work page 1994
[18]

Neural Machine Translation of Rare Words with Subword Units

Sennrich, Rico and Haddow, Barry and Birch, Alexandra. Neural Machine Translation of Rare Words with Subword Units. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2016. doi:10.18653/v1/P16-1162

work page doi:10.18653/v1/p16-1162 2016
[19]

Applied Artificial Intelligence , volume =

Sanghyun Choo and Wonjoon Kim , title =. Applied Artificial Intelligence , volume =. 2023 , publisher =. doi:10.1080/08839514.2023.2175112 , URL =

work page doi:10.1080/08839514.2023.2175112 2023
[20]

A Formal Perspective on Byte-Pair Encoding

Zouhar, Vil \'e m and Meister, Clara and Gastaldi, Juan and Du, Li and Vieira, Tim and Sachan, Mrinmaya and Cotterell, Ryan. A Formal Perspective on Byte-Pair Encoding. Findings of the Association for Computational Linguistics: ACL 2023. 2023. doi:10.18653/v1/2023.findings-acl.38

work page doi:10.18653/v1/2023.findings-acl.38 2023
[21]

Tokenization and the Noiseless Channel

Zouhar, Vil \'e m and Meister, Clara and Gastaldi, Juan and Du, Li and Sachan, Mrinmaya and Cotterell, Ryan. Tokenization and the Noiseless Channel. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023. doi:10.18653/v1/2023.acl-long.284

work page doi:10.18653/v1/2023.acl-long.284 2023
[22]

Investigating the Effectiveness of BPE : The Power of Shorter Sequences

Gall \'e , Matthias. Investigating the Effectiveness of BPE : The Power of Shorter Sequences. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2019. doi:10.18653/v1/D19-1141

work page doi:10.18653/v1/d19-1141 2019
[23]

Finding the Optimal Vocabulary Size for Neural Machine Translation

Gowda, Thamme and May, Jonathan. Finding the Optimal Vocabulary Size for Neural Machine Translation. Findings of the Association for Computational Linguistics: EMNLP 2020. 2020. doi:10.18653/v1/2020.findings-emnlp.352

work page doi:10.18653/v1/2020.findings-emnlp.352 2020
[24]

Two Counterexamples to Tokenization and the Noiseless Channel

Cognetta, Marco and Zouhar, Vil \'e m and Moon, Sangwhan and Okazaki, Naoaki. Two Counterexamples to Tokenization and the Noiseless Channel. Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024). 2024

work page 2024
[25]

Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates

Kudo, Taku. Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2018. doi:10.18653/v1/P18-1007

work page doi:10.18653/v1/p18-1007 2018
[26]

The Twelfth International Conference on Learning Representations , year=

Language Modeling Is Compression , author=. The Twelfth International Conference on Learning Representations , year=

work page
[27]

Tokenizer Choice For LLM Training: Negligible or Crucial?

Ali, Mehdi and Fromm, Michael and Thellmann, Klaudia and Rutmann, Richard and L. Tokenizer Choice For LLM Training: Negligible or Crucial?. Findings of the Association for Computational Linguistics: NAACL 2024. 2024. doi:10.18653/v1/2024.findings-naacl.247

work page doi:10.18653/v1/2024.findings-naacl.247 2024
[28]

Unpacking Tokenization: Evaluating Text Compression and its Correlation with Model Performance

Goldman, Omer and Caciularu, Avi and Eyal, Matan and Cao, Kris and Szpektor, Idan and Tsarfaty, Reut. Unpacking Tokenization: Evaluating Text Compression and its Correlation with Model Performance. Findings of the Association for Computational Linguistics: ACL 2024. 2024. doi:10.18653/v1/2024.findings-acl.134

work page doi:10.18653/v1/2024.findings-acl.134 2024
[29]

GPT-4 Technical Report

OpenAI , year=. 2303.08774 , archivePrefix=

work page internal anchor Pith review Pith/arXiv arXiv
[30]

Hugo Touvron and Thibaut Lavril and Gautier Izacard and Xavier Martinet and Marie-Anne Lachaux and Timothée Lacroix and Baptiste Rozière and Naman Goyal and Eric Hambro and Faisal Azhar and Aurelien Rodriguez and Armand Joulin and Edouard Grave and Guillaume Lample , url=

work page
[31]

Biderman, Stella and Schoelkopf, Hailey and Anthony, Quentin and Bradley, Herbie and O'Brien, Kyle and Hallahan, Eric and Khan, Mohammad Aflah and Purohit, Shivanshu and Prashanth, USVSN Sai and Raff, Edward and Skowron, Aviya and Sutawika, Lintang and Van Der Wal, Oskar , booktitle =. Pythia:

work page
[32]

Second Conference on Language Modeling , year=

Boundless Byte Pair Encoding: Breaking the Pre-tokenization Barrier , author=. Second Conference on Language Modeling , year=

work page
[33]

Smith and Yejin Choi , booktitle=

Alisa Liu and Jonathan Hayase and Valentin Hofmann and Sewoong Oh and Noah A. Smith and Yejin Choi , booktitle=. Super. 2025 , url=

work page 2025
[34]

Dijkstra, E. W. , title =. Numerische Mathematik , year =

work page
[35]

Reducibility among combinatorial problems

Karp, Richard M. Reducibility among Combinatorial Problems. Complexity of Computer Computations: Proceedings of a symposium on the Complexity of Computer Computations, held March 20--22, 1972, at the IBM Thomas J. Watson Research Center, Yorktown Heights, New York, and sponsored by the Office of Naval Research, Mathematics Program, IBM World Trade Corpora...

work page doi:10.1007/978-1-4684-2001-2_9 1972
[36]

Randomized Algorithms , publisher=

Motwani, Rajeev and Raghavan, Prabhakar , year=. Randomized Algorithms , publisher=

work page
[37]

and Tardos,

Shmoys, David B. and Tardos,. An approximation algorithm for the generalized assignment problem , url =. Mathematical Programming , number =. 1993 , bdsk-url-1 =. doi:10.1007/BF01585178 , id =

work page doi:10.1007/bf01585178 1993
[38]

, title =

Vazirani, Vijay V. , title =. 2010 , url=

work page 2010
[39]

and Steiglitz, Kenneth , title =

Papadimitriou, Christos H. and Steiglitz, Kenneth , title =. 1982 , isbn =

work page 1982
[40]

Proceedings of the twentieth annual ACM symposium on Theory of computing , year=

Expressing combinatorial optimization problems by linear programs , author=. Proceedings of the twentieth annual ACM symposium on Theory of computing , year=

work page
[41]

Extension Complexity of Independent Set Polytopes , journal =

G\". Extension Complexity of Independent Set Polytopes , journal =. 2018 , doi =. https://doi.org/10.1137/16M109884X , abstract =

work page doi:10.1137/16m109884x 2018
[42]

Practical large-scale linear programming using primal-dual hybrid gradient , year =

Applegate, David and D\'. Practical large-scale linear programming using primal-dual hybrid gradient , year =. Proceedings of the 35th International Conference on Neural Information Processing Systems , articleno =

work page
[43]

2025 , url =

NVIDIA cuOpt User Guide: LP/QP/MILP Settings , author =. 2025 , url =

work page 2025
[44]

Scaling Laws for Neural Language Models

Kaplan, Jared and McCandlish, Sam and Henighan, Tom and Brown, Tom B. and Chess, Benjamin and Child, Rewon and Gray, Scott and Radford, Alec and Wu, Jeffrey and Amodei, Dario. Scaling Laws for Neural Language Models. 2020. arXiv:2001.08361

work page internal anchor Pith review Pith/arXiv arXiv 2020
[45]

and Sifre, Laurent , title =

Hoffmann, Jordan and Borgeaud, Sebastian and Mensch, Arthur and Buchatskaya, Elena and Cai, Trevor and Rutherford, Eliza and de Las Casas, Diego and Hendricks, Lisa Anne and Welbl, Johannes and Clark, Aidan and Hennigan, Tom and Noland, Eric and Millican, Katie and van den Driessche, George and Damoc, Bogdan and Guy, Aurelia and Osindero, Simon and Simony...

work page 2022
[46]

Applegate and Robert E

David L. Applegate and Robert E. Bixby and Vašek Chvatál and William J. Cook , publisher =. The Traveling Salesman Problem: A Computational Study , urldate =

work page
[47]

BPE Gets Picky: Efficient Vocabulary Refinement During Tokenizer Training

Chizhov, Pavel and Arnett, Catherine and Korotkova, Elizaveta and Yamshchikov, Ivan P. BPE Gets Picky: Efficient Vocabulary Refinement During Tokenizer Training. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.925

work page doi:10.18653/v1/2024.emnlp-main.925 2024
[48]

and Kruskal, Joseph B

Hoffman, Alan J. and Kruskal, Joseph B. Integral Boundary Points of Convex Polyhedra. 50 Years of Integer Programming 1958-2008: From the Early Years to the State-of-the-Art. 2010. doi:10.1007/978-3-540-68279-0_3

work page doi:10.1007/978-3-540-68279-0_3 1958
[49]

Scaffold-

Lian, Haoran and Xiong, Yizhe and Niu, Jianwei and Mo, Shasha and Su, Zhenpeng and Lin, Zijia and Chen, Hui and Han, Jungong and Ding, Guiguang , volume=. Scaffold-. Proceedings of the AAAI Conference on Artificial Intelligence , year=. doi:10.1609/aaai.v39i23.34633 , number=

work page doi:10.1609/aaai.v39i23.34633
[50]

Proceedings of the Fifth Workshop on Insights from Negative Results in NLP , doi =

Cognetta, Marco and Hiraoka, Tatsuya and Sennrich, Rico and Pinter, Yuval and Okazaki, Naoaki , title =. Proceedings of the Fifth Workshop on Insights from Negative Results in NLP , doi =. 2024 , month = jun, pages =

work page 2024
[51]

Paths and cycles in colored graphs

Hajo Broersma and Xueliang Li and Gerhard Woeginger and Shenggui Zhang. Paths and cycles in colored graphs. Australasian journal of combinatorics. 2005

work page 2005
[52]

Journal of Combinatorial Optimization , year=

Approximation algorithms and hardness results for labeled connectivity problems , author=. Journal of Combinatorial Optimization , year=

work page
[53]

Approximation and hardness results for label cut and related problems , volume =

Zhang, Peng and Cai, Jin-Yi and Tang, Lin-Qing and Zhao, Wen-Bo , year =. Approximation and hardness results for label cut and related problems , volume =. Journal of Combinatorial Optimization , doi =

work page
[54]

Nemotron-

Shizhe Diao and Yu Yang and Yonggan Fu and Xin Dong and Dan SU and Markus Kliegl and ZIJIA CHEN and Peter Belcak and Yoshi Suhara and Hongxu Yin and Mostofa Patwary and Yingyan Celine Lin and Jan Kautz and Pavlo Molchanov , booktitle=. Nemotron-. 2026 , url=

work page 2026
[55]

Forsythe, Alasdair , year =

work page

[1] [2]

and Reddy, Varshini and Zhang, Haoran and Alameddine, Alec and Uzan, Omri and Pinter, Yuval and Tanner, Chris

Schmidt, Craig W. and Reddy, Varshini and Zhang, Haoran and Alameddine, Alec and Uzan, Omri and Pinter, Yuval and Tanner, Chris. Tokenization Is More Than Compression. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.40

work page doi:10.18653/v1/2024.emnlp-main.40 2024

[2] [3]

and Shmoys, David B

Williamson, David P. and Shmoys, David B. , title =. 2011 , isbn =

work page 2011

[3] [4]

L. R. Ford and D. R. Fulkerson , publisher =. Flows in Networks , urldate =

work page

[4] [5]

Dantzig , journal =

George B. Dantzig , journal =. On the Shortest Route Through a Network , urldate =

work page

[5] [6]

NLLB Team and Costa-juss \`a , Marta R. and Cross, James and C elebi, Onur and Elbayad, Maha and Heafield, Kenneth and Heffernan, Kevin and Kalbassi, Elahe and Lam, Janice and Licht, Daniel and Maillard, Jean and Sun, Anna and Wang, Skyler and Wenzek, Guillaume and Youngblood, Al and Akula, Bapi and Barrault, Loic and Gonzalez, Gabriel Mejia and Hansanti,...

work page doi:10.1038/s41586-024-07335-x 2024

[6] [7]

The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

A Partition Cover Approach to Tokenization , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

work page

[7] [8]

The Fourteenth International Conference on Learning Representations , year=

Tokenisation over Bounded Alphabets is Hard , author=. The Fourteenth International Conference on Learning Representations , year=

work page

[8] [9]

2024 , eprint=

Theoretical Analysis of Byte-Pair Encoding , author=. 2024 , eprint=

work page 2024

[9] [10]

Karger and Debmalya Panigrahi , title =

Mohsen Ghaffari and David R. Karger and Debmalya Panigrahi , title =. Proceedings of the 2017 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA) , chapter =. 2017 , publisher=. doi:10.1137/1.9781611974782.71 , URL =

work page doi:10.1137/1.9781611974782.71 2017

[10] [11]

2026 , eprint=

Olmo 3 , author=. 2026 , eprint=

work page 2026

[11] [12]

Meister, Clara , year =

work page

[12] [13]

Tokenisation is NP -complete

Whittington, Philip and Bachmann, Gregor and Pimentel, Tiago. Tokenisation is NP -complete. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025. doi:10.18653/v1/2025.acl-long.1365

work page doi:10.18653/v1/2025.acl-long.1365 2025

[13] [14]

2025 , publisher =

Andrej Karpathy , title =. 2025 , publisher =

work page 2025

[14] [15]

Jeffrey Li and Alex Fang and Georgios Smyrnis and Maor Ivgi and Matt Jordan and Samir Yitzhak Gadre and Hritik Bansal and Etash Kumar Guha and Sedrick Keh and Kushal Arora and Saurabh Garg and Rui Xin and Niklas Muennighoff and Reinhard Heckel and Jean Mercat and Mayee F Chen and Suchin Gururangan and Mitchell Wortsman and Alon Albalak and Yonatan Bitton ...

work page 2024

[15] [16]

, title =

Klotz, Ed and Newman, Alexandra M. , title =. Surveys in Operations Research and Management Science , volume =. 2013 , doi =

work page 2013

[16] [17]

C Users Journal , month =

Gage, Philip , title =. C Users Journal , month =. 1994 , issue_date =

work page 1994

[17] [18]

Neural Machine Translation of Rare Words with Subword Units

Sennrich, Rico and Haddow, Barry and Birch, Alexandra. Neural Machine Translation of Rare Words with Subword Units. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2016. doi:10.18653/v1/P16-1162

work page doi:10.18653/v1/p16-1162 2016

[18] [19]

Applied Artificial Intelligence , volume =

Sanghyun Choo and Wonjoon Kim , title =. Applied Artificial Intelligence , volume =. 2023 , publisher =. doi:10.1080/08839514.2023.2175112 , URL =

work page doi:10.1080/08839514.2023.2175112 2023

[19] [20]

A Formal Perspective on Byte-Pair Encoding

Zouhar, Vil \'e m and Meister, Clara and Gastaldi, Juan and Du, Li and Vieira, Tim and Sachan, Mrinmaya and Cotterell, Ryan. A Formal Perspective on Byte-Pair Encoding. Findings of the Association for Computational Linguistics: ACL 2023. 2023. doi:10.18653/v1/2023.findings-acl.38

work page doi:10.18653/v1/2023.findings-acl.38 2023

[20] [21]

Tokenization and the Noiseless Channel

Zouhar, Vil \'e m and Meister, Clara and Gastaldi, Juan and Du, Li and Sachan, Mrinmaya and Cotterell, Ryan. Tokenization and the Noiseless Channel. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023. doi:10.18653/v1/2023.acl-long.284

work page doi:10.18653/v1/2023.acl-long.284 2023

[21] [22]

Investigating the Effectiveness of BPE : The Power of Shorter Sequences

Gall \'e , Matthias. Investigating the Effectiveness of BPE : The Power of Shorter Sequences. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2019. doi:10.18653/v1/D19-1141

work page doi:10.18653/v1/d19-1141 2019

[22] [23]

Finding the Optimal Vocabulary Size for Neural Machine Translation

Gowda, Thamme and May, Jonathan. Finding the Optimal Vocabulary Size for Neural Machine Translation. Findings of the Association for Computational Linguistics: EMNLP 2020. 2020. doi:10.18653/v1/2020.findings-emnlp.352

work page doi:10.18653/v1/2020.findings-emnlp.352 2020

[23] [24]

Two Counterexamples to Tokenization and the Noiseless Channel

Cognetta, Marco and Zouhar, Vil \'e m and Moon, Sangwhan and Okazaki, Naoaki. Two Counterexamples to Tokenization and the Noiseless Channel. Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024). 2024

work page 2024

[24] [25]

Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates

Kudo, Taku. Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2018. doi:10.18653/v1/P18-1007

work page doi:10.18653/v1/p18-1007 2018

[25] [26]

The Twelfth International Conference on Learning Representations , year=

Language Modeling Is Compression , author=. The Twelfth International Conference on Learning Representations , year=

work page

[26] [27]

Tokenizer Choice For LLM Training: Negligible or Crucial?

Ali, Mehdi and Fromm, Michael and Thellmann, Klaudia and Rutmann, Richard and L. Tokenizer Choice For LLM Training: Negligible or Crucial?. Findings of the Association for Computational Linguistics: NAACL 2024. 2024. doi:10.18653/v1/2024.findings-naacl.247

work page doi:10.18653/v1/2024.findings-naacl.247 2024

[27] [28]

Unpacking Tokenization: Evaluating Text Compression and its Correlation with Model Performance

Goldman, Omer and Caciularu, Avi and Eyal, Matan and Cao, Kris and Szpektor, Idan and Tsarfaty, Reut. Unpacking Tokenization: Evaluating Text Compression and its Correlation with Model Performance. Findings of the Association for Computational Linguistics: ACL 2024. 2024. doi:10.18653/v1/2024.findings-acl.134

work page doi:10.18653/v1/2024.findings-acl.134 2024

[28] [29]

GPT-4 Technical Report

OpenAI , year=. 2303.08774 , archivePrefix=

work page internal anchor Pith review Pith/arXiv arXiv

[29] [30]

Hugo Touvron and Thibaut Lavril and Gautier Izacard and Xavier Martinet and Marie-Anne Lachaux and Timothée Lacroix and Baptiste Rozière and Naman Goyal and Eric Hambro and Faisal Azhar and Aurelien Rodriguez and Armand Joulin and Edouard Grave and Guillaume Lample , url=

work page

[30] [31]

Biderman, Stella and Schoelkopf, Hailey and Anthony, Quentin and Bradley, Herbie and O'Brien, Kyle and Hallahan, Eric and Khan, Mohammad Aflah and Purohit, Shivanshu and Prashanth, USVSN Sai and Raff, Edward and Skowron, Aviya and Sutawika, Lintang and Van Der Wal, Oskar , booktitle =. Pythia:

work page

[31] [32]

Second Conference on Language Modeling , year=

Boundless Byte Pair Encoding: Breaking the Pre-tokenization Barrier , author=. Second Conference on Language Modeling , year=

work page

[32] [33]

Smith and Yejin Choi , booktitle=

Alisa Liu and Jonathan Hayase and Valentin Hofmann and Sewoong Oh and Noah A. Smith and Yejin Choi , booktitle=. Super. 2025 , url=

work page 2025

[33] [34]

Dijkstra, E. W. , title =. Numerische Mathematik , year =

work page

[34] [35]

Reducibility among combinatorial problems

Karp, Richard M. Reducibility among Combinatorial Problems. Complexity of Computer Computations: Proceedings of a symposium on the Complexity of Computer Computations, held March 20--22, 1972, at the IBM Thomas J. Watson Research Center, Yorktown Heights, New York, and sponsored by the Office of Naval Research, Mathematics Program, IBM World Trade Corpora...

work page doi:10.1007/978-1-4684-2001-2_9 1972

[35] [36]

Randomized Algorithms , publisher=

Motwani, Rajeev and Raghavan, Prabhakar , year=. Randomized Algorithms , publisher=

work page

[36] [37]

and Tardos,

Shmoys, David B. and Tardos,. An approximation algorithm for the generalized assignment problem , url =. Mathematical Programming , number =. 1993 , bdsk-url-1 =. doi:10.1007/BF01585178 , id =

work page doi:10.1007/bf01585178 1993

[37] [38]

, title =

Vazirani, Vijay V. , title =. 2010 , url=

work page 2010

[38] [39]

and Steiglitz, Kenneth , title =

Papadimitriou, Christos H. and Steiglitz, Kenneth , title =. 1982 , isbn =

work page 1982

[39] [40]

Proceedings of the twentieth annual ACM symposium on Theory of computing , year=

Expressing combinatorial optimization problems by linear programs , author=. Proceedings of the twentieth annual ACM symposium on Theory of computing , year=

work page

[40] [41]

Extension Complexity of Independent Set Polytopes , journal =

G\". Extension Complexity of Independent Set Polytopes , journal =. 2018 , doi =. https://doi.org/10.1137/16M109884X , abstract =

work page doi:10.1137/16m109884x 2018

[41] [42]

Practical large-scale linear programming using primal-dual hybrid gradient , year =

Applegate, David and D\'. Practical large-scale linear programming using primal-dual hybrid gradient , year =. Proceedings of the 35th International Conference on Neural Information Processing Systems , articleno =

work page

[42] [43]

2025 , url =

NVIDIA cuOpt User Guide: LP/QP/MILP Settings , author =. 2025 , url =

work page 2025

[43] [44]

Scaling Laws for Neural Language Models

Kaplan, Jared and McCandlish, Sam and Henighan, Tom and Brown, Tom B. and Chess, Benjamin and Child, Rewon and Gray, Scott and Radford, Alec and Wu, Jeffrey and Amodei, Dario. Scaling Laws for Neural Language Models. 2020. arXiv:2001.08361

work page internal anchor Pith review Pith/arXiv arXiv 2020

[44] [45]

and Sifre, Laurent , title =

Hoffmann, Jordan and Borgeaud, Sebastian and Mensch, Arthur and Buchatskaya, Elena and Cai, Trevor and Rutherford, Eliza and de Las Casas, Diego and Hendricks, Lisa Anne and Welbl, Johannes and Clark, Aidan and Hennigan, Tom and Noland, Eric and Millican, Katie and van den Driessche, George and Damoc, Bogdan and Guy, Aurelia and Osindero, Simon and Simony...

work page 2022

[45] [46]

Applegate and Robert E

David L. Applegate and Robert E. Bixby and Vašek Chvatál and William J. Cook , publisher =. The Traveling Salesman Problem: A Computational Study , urldate =

work page

[46] [47]

BPE Gets Picky: Efficient Vocabulary Refinement During Tokenizer Training

Chizhov, Pavel and Arnett, Catherine and Korotkova, Elizaveta and Yamshchikov, Ivan P. BPE Gets Picky: Efficient Vocabulary Refinement During Tokenizer Training. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.925

work page doi:10.18653/v1/2024.emnlp-main.925 2024

[47] [48]

and Kruskal, Joseph B

Hoffman, Alan J. and Kruskal, Joseph B. Integral Boundary Points of Convex Polyhedra. 50 Years of Integer Programming 1958-2008: From the Early Years to the State-of-the-Art. 2010. doi:10.1007/978-3-540-68279-0_3

work page doi:10.1007/978-3-540-68279-0_3 1958

[48] [49]

Scaffold-

Lian, Haoran and Xiong, Yizhe and Niu, Jianwei and Mo, Shasha and Su, Zhenpeng and Lin, Zijia and Chen, Hui and Han, Jungong and Ding, Guiguang , volume=. Scaffold-. Proceedings of the AAAI Conference on Artificial Intelligence , year=. doi:10.1609/aaai.v39i23.34633 , number=

work page doi:10.1609/aaai.v39i23.34633

[49] [50]

Proceedings of the Fifth Workshop on Insights from Negative Results in NLP , doi =

Cognetta, Marco and Hiraoka, Tatsuya and Sennrich, Rico and Pinter, Yuval and Okazaki, Naoaki , title =. Proceedings of the Fifth Workshop on Insights from Negative Results in NLP , doi =. 2024 , month = jun, pages =

work page 2024

[50] [51]

Paths and cycles in colored graphs

Hajo Broersma and Xueliang Li and Gerhard Woeginger and Shenggui Zhang. Paths and cycles in colored graphs. Australasian journal of combinatorics. 2005

work page 2005

[51] [52]

Journal of Combinatorial Optimization , year=

Approximation algorithms and hardness results for labeled connectivity problems , author=. Journal of Combinatorial Optimization , year=

work page

[52] [53]

Approximation and hardness results for label cut and related problems , volume =

Zhang, Peng and Cai, Jin-Yi and Tang, Lin-Qing and Zhao, Wen-Bo , year =. Approximation and hardness results for label cut and related problems , volume =. Journal of Combinatorial Optimization , doi =

work page

[53] [54]

Nemotron-

Shizhe Diao and Yu Yang and Yonggan Fu and Xin Dong and Dan SU and Markus Kliegl and ZIJIA CHEN and Peter Belcak and Yoshi Suhara and Hongxu Yin and Mostofa Patwary and Yingyan Celine Lin and Jan Kautz and Pavlo Molchanov , booktitle=. Nemotron-. 2026 , url=

work page 2026

[54] [55]

Forsythe, Alasdair , year =

work page