pith. sign in

arxiv: 2605.22821 · v1 · pith:A3HSD7RNnew · submitted 2026-05-21 · 💻 cs.CL · cs.LG

Tokenisation via Convex Relaxations

Pith reviewed 2026-05-22 05:16 UTC · model grok-4.3

classification 💻 cs.CL cs.LG
keywords tokenizationconvex optimizationlinear programminglanguage modelingBPEvocabulary selectionbits-per-byte
0
0 comments X

The pith

Tokenization can be cast as a linear program whose solution yields vocabularies within 1% of optimal.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper replaces greedy algorithms such as BPE and Unigram with a formulation of vocabulary selection as a linear program that is solved by convex optimization. The resulting method, ConvexTok, produces tokenizers that improve intrinsic metrics and the bits-per-byte achieved by language models. It also supplies a lower bound that certifies how close any given tokenizer is to the optimum under the chosen objective. A reader would care because the approach turns an opaque design choice into a verifiable optimization problem that can be tightened or relaxed as needed.

Core claim

By relaxing the discrete problem of choosing a vocabulary of fixed size into a linear program, ConvexTok finds tokenizers that improve bits-per-byte on language models and lie within 1% of the lower bound on the objective at typical vocabulary sizes; the same construction also improves most intrinsic tokenization metrics while producing mixed results on downstream tasks.

What carries the argument

ConvexTok, the solution of a linear program that relaxes vocabulary selection into a convex objective over token frequencies and coverage constraints.

If this is right

  • Language models achieve lower bits-per-byte with ConvexTok vocabularies than with BPE or Unigram at equal vocabulary size.
  • Intrinsic tokenization metrics such as fertility and coverage improve consistently.
  • A computable lower bound certifies that the obtained tokenizer is within 1% of optimal under the linear objective.
  • Downstream task accuracy gains appear but are smaller and less consistent than the efficiency gains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same linear-programming framing could be applied to other preprocessing stages such as sentence segmentation or morphological segmentation.
  • If the objective can be made differentiable, the relaxation might be inserted directly into end-to-end model training.
  • Multilingual settings could benefit from joint optimization over multiple languages inside one linear program.

Load-bearing premise

The linear objective and its relaxation are assumed to be a faithful proxy for what actually improves language-model performance.

What would settle it

Train identical language models on corpora tokenized by ConvexTok versus BPE at the same vocabulary size and measure whether the bits-per-byte gap disappears or reverses.

Figures

Figures reproduced from arXiv: 2605.22821 by Craig W. Schmidt, Dennis Komm, Jan Tempus, Philip Whittington, Tiago Pimentel.

Figure 1
Figure 1. Figure 1: Tokenisation graph con￾structed from D = {abaa, aba}. Black edges represent Ebyte and others represent Etok. Our paper’s main contribution is an LP-based tokenisation algo￾rithm which directly approximates a globally-optimal tokeniser; thus, avoiding locally-optimal solutions. As our tokeniser relies on a convex relation, we term it ConvexTok. Before we get to our LP, however, we first express the problem … view at source ↗
Figure 2
Figure 2. Figure 2: The Det (left), Bias (center), and Int (right) rounding schemes. of the token it represents (c/length(c), with a slight abuse of notation to let length(c) denote the length of the corresponding token), and then rounds these values (to 1 or 0) as before. This biases the selection towards shorter tokens when LP scores are comparable, and is motivated by the fact that shorter tokens are more likely to occur o… view at source ↗
Figure 3
Figure 3. Figure 3: Average Jaccard similarity between vocabularies when retraining a tokeniser on indepen [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Compression by the different tokenisers. (Left) absolute and (Right) relative to [PITH_FULL_IMAGE:figures/full_fig_p016_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Vocabulary utilisation by different tokenisers. (Left) absolute and (Right) relative to [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Type-token ratio by different tokenisers. (Left) absolute and (Right) relative to [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Token length of different tokenisers. (Left) absolute and (Right) relative to [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Tokens per line of different tokenisers. (Left) absolute and (Right) relative to [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Shannon entropy of different tokenisers. (Left) absolute and (Right) relative to [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Rényi entropy of different tokenisers. (Left) absolute and (Right) relative to [PITH_FULL_IMAGE:figures/full_fig_p017_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Average rank of different tokenisers. (Left) absolute and (Right) relative to [PITH_FULL_IMAGE:figures/full_fig_p017_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Compression by the different tokenisers. (Left) absolute and (Right) relative to [PITH_FULL_IMAGE:figures/full_fig_p018_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Vocabulary utilisation by different tokenisers. (Left) absolute and (Right) relative to [PITH_FULL_IMAGE:figures/full_fig_p018_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Type-token ratio by different tokenisers. (Left) absolute and (Right) relative to [PITH_FULL_IMAGE:figures/full_fig_p018_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Token length of different tokenisers. (Left) absolute and (Right) relative to [PITH_FULL_IMAGE:figures/full_fig_p018_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Tokens per line of different tokenisers. (Left) absolute and (Right) relative to [PITH_FULL_IMAGE:figures/full_fig_p019_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Shannon entropy of different tokenisers. (Left) absolute and (Right) relative to [PITH_FULL_IMAGE:figures/full_fig_p019_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Rényi entropy of different tokenisers. (Left) absolute and (Right) relative to [PITH_FULL_IMAGE:figures/full_fig_p019_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Average rank of different tokenisers. (Left) absolute and (Right) relative to [PITH_FULL_IMAGE:figures/full_fig_p019_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Compression by the different tokenisers. (Left) absolute and (Right) relative to [PITH_FULL_IMAGE:figures/full_fig_p020_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Vocabulary utilisation by different tokenisers. (Left) absolute and (Right) relative to [PITH_FULL_IMAGE:figures/full_fig_p020_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: Type-token ratio by different tokenisers. (Left) absolute and (Right) relative to [PITH_FULL_IMAGE:figures/full_fig_p020_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: Token length of different tokenisers. (Left) absolute and (Right) relative to [PITH_FULL_IMAGE:figures/full_fig_p020_23.png] view at source ↗
Figure 24
Figure 24. Figure 24: Tokens per line of different tokenisers. (Left) absolute and (Right) relative to [PITH_FULL_IMAGE:figures/full_fig_p021_24.png] view at source ↗
Figure 25
Figure 25. Figure 25: Shannon entropy of different tokenisers. (Left) absolute and (Right) relative to [PITH_FULL_IMAGE:figures/full_fig_p021_25.png] view at source ↗
Figure 26
Figure 26. Figure 26: Rényi entropy of different tokenisers. (Left) absolute and (Right) relative to [PITH_FULL_IMAGE:figures/full_fig_p021_26.png] view at source ↗
Figure 27
Figure 27. Figure 27: Average rank of different tokenisers. (Left) absolute and (Right) relative to [PITH_FULL_IMAGE:figures/full_fig_p021_27.png] view at source ↗
Figure 33
Figure 33. Figure 33: (left) BpB and (right) CORE vs. vocabulary size across models with different depths [PITH_FULL_IMAGE:figures/full_fig_p023_33.png] view at source ↗
Figure 34
Figure 34. Figure 34: (left) BpB and (right) CORE vs. vocabulary size across three training seeds. All these models were trained with 12 layers. 23 [PITH_FULL_IMAGE:figures/full_fig_p023_34.png] view at source ↗
read the original abstract

Tokenisation is an integral part of the current NLP pipeline. Current tokenisation algorithms such as BPE and Unigram are greedy algorithms -- they make locally optimal decisions without considering the resulting vocabulary as a whole. We instead formulate tokeniser construction as a linear program and solve it using convex optimisation tools, yielding a new algorithm we call ConvexTok. We find ConvexTok consistently improves intrinsic tokenisation metrics and the bits-per-byte (BpB) achieved by language models; it also improves downstream task performance, but less consistently. Furthermore, ConvexTok allows the user to certify how far their tokeniser is from optimal, with respect to a certain objective, via a lower bound, and we empirically find it to be within 1\% of optimal at common vocabulary sizes.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper presents ConvexTok, a new tokenization method that formulates the construction of the vocabulary as a linear program solved using convex optimization techniques. Unlike greedy algorithms such as BPE and Unigram, it considers the vocabulary as a whole. The authors report that ConvexTok consistently improves intrinsic tokenization metrics and the bits-per-byte (BpB) of language models, with less consistent improvements on downstream tasks. Additionally, it provides a lower bound to certify the distance to optimality with respect to the LP objective, empirically showing solutions within 1% of optimal for common vocabulary sizes.

Significance. If the results hold and the LP objective is a good proxy for LM performance, this work introduces a principled, optimization-based approach to tokenization with the novel feature of optimality certificates. This could be significant for the field as it moves beyond heuristic methods. The use of convex relaxations and lower bounds is a strength that allows for verifiable claims.

major comments (2)
  1. Abstract: The claim that ConvexTok 'consistently improves' BpB and downstream performance, and is 'within 1% of optimal', requires more detail on the specific LP formulation, the definition of the objective function, data exclusion rules, and statistical significance tests to rule out post-hoc fitting or selection effects.
  2. Method: The linear programming relaxation and chosen objective function need to be validated as a faithful model of tokenizer quality for language modeling. No ablation is mentioned that compares the LP objective value to the actual negative log-likelihood of a trained transformer or holds data and model fixed while varying the tokenizer objective.
minor comments (1)
  1. Abstract: The abstract mentions 'a certain objective' for the optimality certificate; this should be explicitly defined early in the paper.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the referee's insightful comments on our work. We respond to each major comment below and indicate the revisions we plan to make.

read point-by-point responses
  1. Referee: Abstract: The claim that ConvexTok 'consistently improves' BpB and downstream performance, and is 'within 1% of optimal', requires more detail on the specific LP formulation, the definition of the objective function, data exclusion rules, and statistical significance tests to rule out post-hoc fitting or selection effects.

    Authors: We concur that providing more detail in the abstract would strengthen the presentation of our results. In the revised manuscript, we will elaborate on the LP formulation and the definition of the objective function. We will also specify the data exclusion rules applied in our experiments and include statistical significance tests for the reported improvements in BpB and downstream performance to address potential concerns about selection effects. The claim of being within 1% of optimal is supported by the lower bound computation described in the paper, and we will ensure this is clearly contextualized. revision: yes

  2. Referee: Method: The linear programming relaxation and chosen objective function need to be validated as a faithful model of tokenizer quality for language modeling. No ablation is mentioned that compares the LP objective value to the actual negative log-likelihood of a trained transformer or holds data and model fixed while varying the tokenizer objective.

    Authors: This is a fair point regarding the validation of our objective. Our LP formulation optimizes a specific objective that we argue serves as a reasonable proxy for tokenizer quality, as demonstrated by the improvements in intrinsic metrics and BpB. However, we acknowledge that a direct ablation comparing the LP objective to the NLL of a fixed transformer model was not included. We will add a section in the revised paper discussing the motivation for the chosen objective and its empirical correlation with LM performance. A full ablation study may be included if it can be completed within the revision timeline, but we note that such experiments are computationally intensive. We disagree that this invalidates the current contributions, as the optimality certificate is a novel aspect independent of this validation. revision: partial

Circularity Check

0 steps flagged

No circularity: LP formulation and empirical gains are independent

full rationale

The derivation introduces a linear-programming relaxation of vocabulary selection as a novel modeling choice solved by standard convex-optimization tools; the resulting tokenizers are then evaluated on separate, externally measured quantities (BpB of trained language models and downstream task accuracy). No equation equates a claimed improvement or optimality gap to a quantity defined by the same fitted parameters or by a self-citation chain; the 1% optimality certificate is explicitly relative to the paper's own LP objective, while BpB and task gains are reported from independent training runs. The method is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that tokenization quality can be usefully expressed as the objective of a linear program whose relaxation yields practical solutions. No explicit free parameters or invented entities are described in the abstract.

axioms (1)
  • domain assumption Tokenization construction admits a linear programming formulation whose relaxation is solvable by convex optimization tools
    This premise is required for the ConvexTok algorithm to exist and is invoked when the abstract contrasts the new method with greedy algorithms.

pith-pipeline@v0.9.0 · 5658 in / 1441 out tokens · 44050 ms · 2026-05-22T05:16:50.783338+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

54 extracted references · 54 canonical work pages · 2 internal anchors

  1. [2]

    and Reddy, Varshini and Zhang, Haoran and Alameddine, Alec and Uzan, Omri and Pinter, Yuval and Tanner, Chris

    Schmidt, Craig W. and Reddy, Varshini and Zhang, Haoran and Alameddine, Alec and Uzan, Omri and Pinter, Yuval and Tanner, Chris. Tokenization Is More Than Compression. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.40

  2. [3]

    and Shmoys, David B

    Williamson, David P. and Shmoys, David B. , title =. 2011 , isbn =

  3. [4]

    L. R. Ford and D. R. Fulkerson , publisher =. Flows in Networks , urldate =

  4. [5]

    Dantzig , journal =

    George B. Dantzig , journal =. On the Shortest Route Through a Network , urldate =

  5. [6]

    NLLB Team and Costa-juss \`a , Marta R. and Cross, James and C elebi, Onur and Elbayad, Maha and Heafield, Kenneth and Heffernan, Kevin and Kalbassi, Elahe and Lam, Janice and Licht, Daniel and Maillard, Jean and Sun, Anna and Wang, Skyler and Wenzek, Guillaume and Youngblood, Al and Akula, Bapi and Barrault, Loic and Gonzalez, Gabriel Mejia and Hansanti,...

  6. [7]

    The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

    A Partition Cover Approach to Tokenization , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

  7. [8]

    The Fourteenth International Conference on Learning Representations , year=

    Tokenisation over Bounded Alphabets is Hard , author=. The Fourteenth International Conference on Learning Representations , year=

  8. [9]

    2024 , eprint=

    Theoretical Analysis of Byte-Pair Encoding , author=. 2024 , eprint=

  9. [10]

    Karger and Debmalya Panigrahi , title =

    Mohsen Ghaffari and David R. Karger and Debmalya Panigrahi , title =. Proceedings of the 2017 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA) , chapter =. 2017 , publisher=. doi:10.1137/1.9781611974782.71 , URL =

  10. [11]

    2026 , eprint=

    Olmo 3 , author=. 2026 , eprint=

  11. [12]

    Meister, Clara , year =

  12. [13]

    Tokenisation is NP -complete

    Whittington, Philip and Bachmann, Gregor and Pimentel, Tiago. Tokenisation is NP -complete. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025. doi:10.18653/v1/2025.acl-long.1365

  13. [14]

    2025 , publisher =

    Andrej Karpathy , title =. 2025 , publisher =

  14. [15]

    Jeffrey Li and Alex Fang and Georgios Smyrnis and Maor Ivgi and Matt Jordan and Samir Yitzhak Gadre and Hritik Bansal and Etash Kumar Guha and Sedrick Keh and Kushal Arora and Saurabh Garg and Rui Xin and Niklas Muennighoff and Reinhard Heckel and Jean Mercat and Mayee F Chen and Suchin Gururangan and Mitchell Wortsman and Alon Albalak and Yonatan Bitton ...

  15. [16]

    , title =

    Klotz, Ed and Newman, Alexandra M. , title =. Surveys in Operations Research and Management Science , volume =. 2013 , doi =

  16. [17]

    C Users Journal , month =

    Gage, Philip , title =. C Users Journal , month =. 1994 , issue_date =

  17. [18]

    Neural Machine Translation of Rare Words with Subword Units

    Sennrich, Rico and Haddow, Barry and Birch, Alexandra. Neural Machine Translation of Rare Words with Subword Units. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2016. doi:10.18653/v1/P16-1162

  18. [19]

    Applied Artificial Intelligence , volume =

    Sanghyun Choo and Wonjoon Kim , title =. Applied Artificial Intelligence , volume =. 2023 , publisher =. doi:10.1080/08839514.2023.2175112 , URL =

  19. [20]

    A Formal Perspective on Byte-Pair Encoding

    Zouhar, Vil \'e m and Meister, Clara and Gastaldi, Juan and Du, Li and Vieira, Tim and Sachan, Mrinmaya and Cotterell, Ryan. A Formal Perspective on Byte-Pair Encoding. Findings of the Association for Computational Linguistics: ACL 2023. 2023. doi:10.18653/v1/2023.findings-acl.38

  20. [21]

    Tokenization and the Noiseless Channel

    Zouhar, Vil \'e m and Meister, Clara and Gastaldi, Juan and Du, Li and Sachan, Mrinmaya and Cotterell, Ryan. Tokenization and the Noiseless Channel. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023. doi:10.18653/v1/2023.acl-long.284

  21. [22]

    Investigating the Effectiveness of BPE : The Power of Shorter Sequences

    Gall \'e , Matthias. Investigating the Effectiveness of BPE : The Power of Shorter Sequences. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2019. doi:10.18653/v1/D19-1141

  22. [23]

    Finding the Optimal Vocabulary Size for Neural Machine Translation

    Gowda, Thamme and May, Jonathan. Finding the Optimal Vocabulary Size for Neural Machine Translation. Findings of the Association for Computational Linguistics: EMNLP 2020. 2020. doi:10.18653/v1/2020.findings-emnlp.352

  23. [24]

    Two Counterexamples to Tokenization and the Noiseless Channel

    Cognetta, Marco and Zouhar, Vil \'e m and Moon, Sangwhan and Okazaki, Naoaki. Two Counterexamples to Tokenization and the Noiseless Channel. Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024). 2024

  24. [25]

    Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates

    Kudo, Taku. Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2018. doi:10.18653/v1/P18-1007

  25. [26]

    The Twelfth International Conference on Learning Representations , year=

    Language Modeling Is Compression , author=. The Twelfth International Conference on Learning Representations , year=

  26. [27]

    Tokenizer Choice For LLM Training: Negligible or Crucial?

    Ali, Mehdi and Fromm, Michael and Thellmann, Klaudia and Rutmann, Richard and L. Tokenizer Choice For LLM Training: Negligible or Crucial?. Findings of the Association for Computational Linguistics: NAACL 2024. 2024. doi:10.18653/v1/2024.findings-naacl.247

  27. [28]

    Unpacking Tokenization: Evaluating Text Compression and its Correlation with Model Performance

    Goldman, Omer and Caciularu, Avi and Eyal, Matan and Cao, Kris and Szpektor, Idan and Tsarfaty, Reut. Unpacking Tokenization: Evaluating Text Compression and its Correlation with Model Performance. Findings of the Association for Computational Linguistics: ACL 2024. 2024. doi:10.18653/v1/2024.findings-acl.134

  28. [29]

    GPT-4 Technical Report

    OpenAI , year=. 2303.08774 , archivePrefix=

  29. [30]

    Hugo Touvron and Thibaut Lavril and Gautier Izacard and Xavier Martinet and Marie-Anne Lachaux and Timothée Lacroix and Baptiste Rozière and Naman Goyal and Eric Hambro and Faisal Azhar and Aurelien Rodriguez and Armand Joulin and Edouard Grave and Guillaume Lample , url=

  30. [31]

    Biderman, Stella and Schoelkopf, Hailey and Anthony, Quentin and Bradley, Herbie and O'Brien, Kyle and Hallahan, Eric and Khan, Mohammad Aflah and Purohit, Shivanshu and Prashanth, USVSN Sai and Raff, Edward and Skowron, Aviya and Sutawika, Lintang and Van Der Wal, Oskar , booktitle =. Pythia:

  31. [32]

    Second Conference on Language Modeling , year=

    Boundless Byte Pair Encoding: Breaking the Pre-tokenization Barrier , author=. Second Conference on Language Modeling , year=

  32. [33]

    Smith and Yejin Choi , booktitle=

    Alisa Liu and Jonathan Hayase and Valentin Hofmann and Sewoong Oh and Noah A. Smith and Yejin Choi , booktitle=. Super. 2025 , url=

  33. [34]

    Dijkstra, E. W. , title =. Numerische Mathematik , year =

  34. [35]

    Reducibility among combinatorial problems

    Karp, Richard M. Reducibility among Combinatorial Problems. Complexity of Computer Computations: Proceedings of a symposium on the Complexity of Computer Computations, held March 20--22, 1972, at the IBM Thomas J. Watson Research Center, Yorktown Heights, New York, and sponsored by the Office of Naval Research, Mathematics Program, IBM World Trade Corpora...

  35. [36]

    Randomized Algorithms , publisher=

    Motwani, Rajeev and Raghavan, Prabhakar , year=. Randomized Algorithms , publisher=

  36. [37]

    and Tardos,

    Shmoys, David B. and Tardos,. An approximation algorithm for the generalized assignment problem , url =. Mathematical Programming , number =. 1993 , bdsk-url-1 =. doi:10.1007/BF01585178 , id =

  37. [38]

    , title =

    Vazirani, Vijay V. , title =. 2010 , url=

  38. [39]

    and Steiglitz, Kenneth , title =

    Papadimitriou, Christos H. and Steiglitz, Kenneth , title =. 1982 , isbn =

  39. [40]

    Proceedings of the twentieth annual ACM symposium on Theory of computing , year=

    Expressing combinatorial optimization problems by linear programs , author=. Proceedings of the twentieth annual ACM symposium on Theory of computing , year=

  40. [41]

    Extension Complexity of Independent Set Polytopes , journal =

    G\". Extension Complexity of Independent Set Polytopes , journal =. 2018 , doi =. https://doi.org/10.1137/16M109884X , abstract =

  41. [42]

    Practical large-scale linear programming using primal-dual hybrid gradient , year =

    Applegate, David and D\'. Practical large-scale linear programming using primal-dual hybrid gradient , year =. Proceedings of the 35th International Conference on Neural Information Processing Systems , articleno =

  42. [43]

    2025 , url =

    NVIDIA cuOpt User Guide: LP/QP/MILP Settings , author =. 2025 , url =

  43. [44]

    Scaling Laws for Neural Language Models

    Kaplan, Jared and McCandlish, Sam and Henighan, Tom and Brown, Tom B. and Chess, Benjamin and Child, Rewon and Gray, Scott and Radford, Alec and Wu, Jeffrey and Amodei, Dario. Scaling Laws for Neural Language Models. 2020. arXiv:2001.08361

  44. [45]

    and Sifre, Laurent , title =

    Hoffmann, Jordan and Borgeaud, Sebastian and Mensch, Arthur and Buchatskaya, Elena and Cai, Trevor and Rutherford, Eliza and de Las Casas, Diego and Hendricks, Lisa Anne and Welbl, Johannes and Clark, Aidan and Hennigan, Tom and Noland, Eric and Millican, Katie and van den Driessche, George and Damoc, Bogdan and Guy, Aurelia and Osindero, Simon and Simony...

  45. [46]

    Applegate and Robert E

    David L. Applegate and Robert E. Bixby and Vašek Chvatál and William J. Cook , publisher =. The Traveling Salesman Problem: A Computational Study , urldate =

  46. [47]

    BPE Gets Picky: Efficient Vocabulary Refinement During Tokenizer Training

    Chizhov, Pavel and Arnett, Catherine and Korotkova, Elizaveta and Yamshchikov, Ivan P. BPE Gets Picky: Efficient Vocabulary Refinement During Tokenizer Training. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.925

  47. [48]

    and Kruskal, Joseph B

    Hoffman, Alan J. and Kruskal, Joseph B. Integral Boundary Points of Convex Polyhedra. 50 Years of Integer Programming 1958-2008: From the Early Years to the State-of-the-Art. 2010. doi:10.1007/978-3-540-68279-0_3

  48. [49]

    Scaffold-

    Lian, Haoran and Xiong, Yizhe and Niu, Jianwei and Mo, Shasha and Su, Zhenpeng and Lin, Zijia and Chen, Hui and Han, Jungong and Ding, Guiguang , volume=. Scaffold-. Proceedings of the AAAI Conference on Artificial Intelligence , year=. doi:10.1609/aaai.v39i23.34633 , number=

  49. [50]

    Proceedings of the Fifth Workshop on Insights from Negative Results in NLP , doi =

    Cognetta, Marco and Hiraoka, Tatsuya and Sennrich, Rico and Pinter, Yuval and Okazaki, Naoaki , title =. Proceedings of the Fifth Workshop on Insights from Negative Results in NLP , doi =. 2024 , month = jun, pages =

  50. [51]

    Paths and cycles in colored graphs

    Hajo Broersma and Xueliang Li and Gerhard Woeginger and Shenggui Zhang. Paths and cycles in colored graphs. Australasian journal of combinatorics. 2005

  51. [52]

    Journal of Combinatorial Optimization , year=

    Approximation algorithms and hardness results for labeled connectivity problems , author=. Journal of Combinatorial Optimization , year=

  52. [53]

    Approximation and hardness results for label cut and related problems , volume =

    Zhang, Peng and Cai, Jin-Yi and Tang, Lin-Qing and Zhao, Wen-Bo , year =. Approximation and hardness results for label cut and related problems , volume =. Journal of Combinatorial Optimization , doi =

  53. [54]

    Nemotron-

    Shizhe Diao and Yu Yang and Yonggan Fu and Xin Dong and Dan SU and Markus Kliegl and ZIJIA CHEN and Peter Belcak and Yoshi Suhara and Hongxu Yin and Mostofa Patwary and Yingyan Celine Lin and Jan Kautz and Pavlo Molchanov , booktitle=. Nemotron-. 2026 , url=

  54. [55]

    Forsythe, Alasdair , year =