MinGram: A Minimalist Unigram Tokenizer with High Compression and Competitive Morphological Alignment

Sander Land

arxiv: 2606.27019 · v1 · pith:GJOJ7COSnew · submitted 2026-06-25 · 💻 cs.CL

MinGram: A Minimalist Unigram Tokenizer with High Compression and Competitive Morphological Alignment

Sander Land This is my paper

Pith reviewed 2026-06-26 04:56 UTC · model grok-4.3

classification 💻 cs.CL

keywords MinGramUnigram tokenizerBPEcompressionmorphological alignmentlanguage model trainingtokenizationbits-per-byte

0 comments

The pith

MinGram simplifies Unigram training with a BPE seed and Hard EM to compress better than BPE and standard Unigram while retaining higher morphological alignment.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to establish that Unigram tokenizers can be trained far more simply than usual without losing their advantages. MinGram starts from a BPE-derived seed vocabulary, runs Hard EM along the minimum-token path, and applies one flat score-pruning step. This removes the suffix array, forward-backward algorithm, and repeated pruning loop, leaving a procedure that needs little beyond ordinary tokenizer inference. Token count is treated as the main goal and the Unigram score is used only to break ties. Across six languages the resulting tokenizers compress better than BPE and standard Unigram; a compression-focused variant matches the best pure count-based methods yet keeps substantially higher morphological alignment; and Unigram-family tokenizers, MinGram among them, consistently produce lower bits-per-byte than BPE in controlled language-model training.

Core claim

MinGram keeps the token-list representation of Unigram but replaces its heavy training machinery with a BPE-derived seed vocabulary, Hard EM on a minimum-token path, and a single flat score-pruning step. By making token count the primary objective and the Unigram score only a tiebreaker, MinGram produces tokenizers that compress better than both BPE and standard Unigram across six languages while retaining substantially higher morphological alignment than pure token-count compressors. In controlled downstream language-model training, Unigram-family tokenizers with MinGram among the best consistently beat BPE when measured by bits-per-byte.

What carries the argument

BPE-derived seed vocabulary plus Hard EM on the minimum-token path and a single flat score-pruning step that ranks token count first and Unigram score second.

If this is right

Simpler training procedures can produce tokenizers that are at least as effective as the more complex originals.
Unigram-family tokenizers can be chosen over BPE when downstream language-model bits-per-byte is the performance metric.
A compression-oriented variant can approach the best token-count compressors without sacrificing as much morphological alignment.
Tokenizer development can focus on minimum-token paths rather than full probabilistic inference.
The same simplification pattern may be reusable for other tokenizer families that currently rely on heavy training loops.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method could let practitioners iterate tokenizer vocabularies more quickly when adapting to new domains or data distributions.
Removing the iterative prune loop may reduce the compute barrier to experimenting with Unigram-style tokenizers on modest hardware.
The emphasis on minimum token count may translate into smaller effective model sizes or faster inference in resource-limited settings.
Similar seed-and-prune shortcuts might be tested on other tokenization objectives such as fertility or downstream task performance.

Load-bearing premise

That a BPE-derived seed vocabulary, Hard EM restricted to the minimum-token path, and one flat pruning step will yield tokenizers whose compression and morphological properties match or exceed those obtained from full Unigram training with its removed components.

What would settle it

Run MinGram and standard Unigram on a seventh language; if MinGram no longer compresses better than BPE or loses the morphological-alignment advantage, the central claim is falsified.

Figures

Figures reproduced from arXiv: 2606.27019 by Sander Land.

**Figure 2.** Figure 2: Compression sensitivity to overshoot factor [PITH_FULL_IMAGE:figures/full_fig_p012_2.png] view at source ↗

read the original abstract

The Unigram tokenizer uses an elegant representation which makes it straightforward to edit vocabularies, but its training is comparatively heavy and complex. We introduce MinGram (Minimalist Unigram), which keeps the token-list representation but simplifies training using a BPE-derived seed vocabulary, Hard EM on a minimum-token path, and a single flat score-pruning step. This removes the suffix array, the forward-backward pass, and the iterative prune loop, leaving a procedure that requires little beyond tokenizer inference itself. By making token count the primary objective and using a Unigram score only as a tiebreak, MinGram keeps the compression of pure token-count methods while retaining much of the morphological alignment and downstream quality of probabilistic ones. Across six languages, MinGram compresses better than both BPE and standard Unigram, and a compression-oriented variant matches the strongest token-count compressors while retaining substantially higher morphological alignment. In controlled downstream language-model training, Unigram-family tokenizers, with MinGram among the best, consistently beat BPE in bits-per-byte.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MinGram strips Unigram training down to BPE seed plus hard EM on min paths and one prune, claiming better compression and morphology, but the equivalence to full Unigram rests on an untested assumption.

read the letter

The main thing to know is that this paper gives a lighter training recipe for Unigram-style tokenizers: start from a BPE seed vocabulary, run hard EM restricted to minimum-token paths, then do one flat score-based prune. That drops the suffix array, forward-backward, and the usual iterative prune loop, leaving something that mostly needs inference code.

What is actually new is the specific combination of those three steps as a drop-in replacement. The paper does well by making token count the main objective and using the Unigram score only as tiebreaker; that choice is explicit and aims to keep compression strong while holding onto morphological alignment. The downstream language-model experiments are also a plus, since they run controlled comparisons and report bits-per-byte rather than just tokenizer metrics.

The soft spot is the core assumption that the stripped procedure still produces tokenizers whose compression and alignment scores meet or beat both BPE and standard Unigram. The abstract states the simplification but gives no ablations on the removed components, no direct comparison of the resulting token probabilities or segmentation distributions, and no statistical tests. If those checks are missing from the full paper too, the equivalence claim stays unverified. The stress-test concern lands here: without evidence that the suffix array and forward-backward were dispensable, the reported gains could be coming from the BPE seed or the hard-EM restriction rather than from the minimalist design itself.

This is for people who build or tune tokenizers and want lower training cost without losing too much quality. A reader who already works on Unigram variants or efficiency trade-offs would get the most out of the empirical numbers, provided the methods section supplies the missing ablations.

I would send it to peer review so the experiments can be checked in detail.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces MinGram, a minimalist Unigram tokenizer that starts from a BPE-derived seed vocabulary and applies Hard EM restricted to minimum-token paths followed by a single flat score-pruning step. This removes the suffix array, forward-backward algorithm, and iterative prune loop from standard Unigram training. The central empirical claims are that MinGram achieves higher compression than both BPE and standard Unigram across six languages, a compression-oriented variant matches the strongest token-count compressors while retaining substantially higher morphological alignment, and Unigram-family tokenizers (with MinGram among the best) consistently outperform BPE in controlled downstream language-model training as measured by bits-per-byte.

Significance. If the results hold after verification of the simplifications, the contribution would be significant for NLP tokenization research. It demonstrates that a substantially lighter training procedure can match or exceed the compression and downstream performance of both BPE and full Unigram while preserving morphological alignment better than pure count-based methods. The explicit prioritization of token count with Unigram score used only as tiebreak is a clean way to combine objectives. The work would be strengthened by reproducible code or parameter-free derivations, but none are mentioned.

major comments (2)

[Training procedure (§3) and Experiments (§4)] The central claim that the BPE-seeded Hard EM on minimum-token paths plus single pruning produces tokenizers whose compression and morphological properties meet or exceed those of full Unigram rests on the unverified assumption that the removed components (suffix array, forward-backward, iterative pruning) are dispensable. No ablation or direct comparison to standard Unigram training is described that would confirm the resulting token probabilities and segmentations remain sufficiently close.
[Abstract and §4.3] The abstract and experimental claims state improvements 'across six languages' and 'consistently beat BPE in bits-per-byte' but supply no baselines, statistical tests, number of runs, variance estimates, or exact metric values. This makes it impossible to assess whether the reported gains are robust or load-bearing for the downstream conclusion.

minor comments (2)

[§3] Notation for the 'flat score-pruning step' and 'minimum-token path' should be defined with explicit equations or pseudocode to allow replication.
[§4.1] The six languages used in the multilingual experiments are not listed; adding this detail would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive feedback. We address each major comment below, indicating planned revisions to improve clarity and rigor.

read point-by-point responses

Referee: [Training procedure (§3) and Experiments (§4)] The central claim that the BPE-seeded Hard EM on minimum-token paths plus single pruning produces tokenizers whose compression and morphological properties meet or exceed those of full Unigram rests on the unverified assumption that the removed components (suffix array, forward-backward, iterative pruning) are dispensable. No ablation or direct comparison to standard Unigram training is described that would confirm the resulting token probabilities and segmentations remain sufficiently close.

Authors: The manuscript directly compares the final MinGram tokenizers to standard Unigram on the key metrics of compression ratio, morphological alignment, and downstream bits-per-byte performance, with MinGram showing higher compression and competitive alignment. These end-to-end results serve as empirical validation that the simplifications preserve (and in some cases improve) the desired properties. We agree that an explicit internal comparison of token probabilities or segmentation distributions would strengthen the argument. We will add such a comparison (e.g., vocabulary overlap and average path length statistics) in the revised version. revision: yes
Referee: [Abstract and §4.3] The abstract and experimental claims state improvements 'across six languages' and 'consistently beat BPE in bits-per-byte' but supply no baselines, statistical tests, number of runs, variance estimates, or exact metric values. This makes it impossible to assess whether the reported gains are robust or load-bearing for the downstream conclusion.

Authors: We will revise the abstract and §4.3 to report exact baseline values, the number of runs (five random seeds for the language-model experiments), variance estimates, and the results of statistical significance tests (paired t-tests across seeds) to make the robustness of the gains explicit. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical claims rest on measured outcomes, not self-referential definitions or derivations.

full rationale

The paper defines MinGram via an explicit algorithmic procedure (BPE seed vocabulary + Hard EM restricted to minimum-token paths + single flat score-pruning) and then reports measured compression ratios, morphological alignment scores, and downstream bits-per-byte on held-out data across six languages. No equations, first-principles derivations, or predictions appear; the central claims are direct experimental results that can be falsified by re-running the procedure on the same corpora. No self-citations are invoked as load-bearing uniqueness theorems, no fitted parameters are relabeled as predictions, and the method is not shown to be equivalent to its inputs by construction. The simplification's validity is an empirical question, not a definitional one.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that token count serves as a reliable primary objective for tokenizer quality and that the simplified training steps preserve Unigram advantages.

axioms (1)

domain assumption Minimizing the number of tokens produced is a valid primary objective for tokenizer quality, with probabilistic scores used only as tiebreakers.
The abstract explicitly states that token count is made the primary objective.

pith-pipeline@v0.9.1-grok · 5704 in / 1300 out tokens · 36791 ms · 2026-06-26T04:56:16.989574+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

55 extracted references · 25 canonical work pages · 1 internal anchor

[1]

Schmidt, Varshini Reddy, Haoran Zhang, Alec Alameddine, Omri Uzan, Yuval Pinter, and Chris Tanner

Schmidt, Craig W and Reddy, Varshini and Zhang, Haoran and Alameddine, Alec and Uzan, Omri and Pinter, Yuval and Tanner, Chris. Tokenization Is More Than Compression. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.40

work page doi:10.18653/v1/2024.emnlp-main.40 2024
[2]

and Tanner, Chris and Pinter, Yuval

Uzan, Omri and Schmidt, Craig W. and Tanner, Chris and Pinter, Yuval. Greed is All You Need: An Evaluation of Tokenizer Inference Methods. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 2024. doi:10.18653/v1/2024.acl-short.73

work page doi:10.18653/v1/2024.acl-short.73 2024
[3]

Tokenization and the Noiseless Channel

Zouhar, Vil \'e m and Meister, Clara and Gastaldi, Juan and Du, Li and Sachan, Mrinmaya and Cotterell, Ryan. Tokenization and the Noiseless Channel. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023. doi:10.18653/v1/2023.acl-long.284

work page doi:10.18653/v1/2023.acl-long.284 2023
[4]

Two Counterexamples to Tokenization and the Noiseless Channel

Cognetta, Marco and Zouhar, Vil \'e m and Moon, Sangwhan and Okazaki, Naoaki. Two Counterexamples to Tokenization and the Noiseless Channel. Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024). 2024

2024
[5]

NLLB Team and Costa-juss \`a , Marta R. and Cross, James and C elebi, Onur and Elbayad, Maha and Heafield, Kenneth and Heffernan, Kevin and Kalbassi, Elahe and Lam, Janice and Licht, Daniel and Maillard, Jean and Sun, Anna and Wang, Skyler and Wenzek, Guillaume and Youngblood, Al and Akula, Bapi and Barrault, Loic and Gonzalez, Gabriel Mejia and Hansanti,...

work page doi:10.1038/s41586-024-07335-x 2024
[6]

Byte pair encoding is suboptimal for language model pretraining

Bostrom, Kaj and Durrett, Greg. Byte Pair Encoding is Suboptimal for Language Model Pretraining. Findings of the Association for Computational Linguistics: EMNLP 2020. 2020. doi:10.18653/v1/2020.findings-emnlp.414

work page doi:10.18653/v1/2020.findings-emnlp.414 2020
[7]

Incorporating Context into Subword Vocabularies

Yehezkel, Shaked and Pinter, Yuval. Incorporating Context into Subword Vocabularies. Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics. 2023. doi:10.18653/v1/2023.eacl-main.45

work page doi:10.18653/v1/2023.eacl-main.45 2023
[8]

Rethinking Tokenization for Rich Morphology: The Dominance of U nigram over BPE and Morphological Alignment

Vemula, Saketh Reddy and Dandapat, Sandipan and Sharma, Dipti and Krishnamurthy, Parameswari. Rethinking Tokenization for Rich Morphology: The Dominance of U nigram over BPE and Morphological Alignment. The 14th International Joint Conference on Natural Language Processing and The 4th Conference of the Asia-Pacific Chapter of the Association for Computati...

work page doi:10.18653/v1/2025.ijcnlp-srw.20 2025
[9]

2502.00894 , archivePrefix=

Ehsaneddin Asgari and Yassine El Kheir and Mohammad Ali Sadraei Javaheri , year=. 2502.00894 , archivePrefix=

arXiv
[10]

Which Pieces Does

Sander Land and Yuval Pinter , year=. Which Pieces Does. 2512.12641 , archivePrefix=

Pith/arXiv arXiv
[11]

2025 , publisher =

Guilherme Penedo , title =. 2025 , publisher =

2025
[12]

2020 , eprint=

BPE-Dropout: Simple and Effective Subword Regularization , author=. 2020 , eprint=

2020
[13]

Evaluating Morphological Plausibility of Subword Tokenization via Statistical Alignment with Morpho-Syntactic Features

Stephen, Abishek and Libovick \'y , Jind r ich. Evaluating Morphological Plausibility of Subword Tokenization via Statistical Alignment with Morpho-Syntactic Features. Findings of the A ssociation for C omputational L inguistics: EACL 2026. 2026. doi:10.18653/v1/2026.findings-eacl.196

work page doi:10.18653/v1/2026.findings-eacl.196 2026
[14]

2025 , eprint=

Evaluating Morphological Alignment of Tokenizers in 70 Languages , author=. 2025 , eprint=

2025
[15]

A. P. Dempster and N. M. Laird and D. B. Rubin , journal =. Maximum Likelihood from Incomplete Data via the
[16]

SentencePiece: a simple and language independent subword tokenizer and detokenizer for neural text processing

Kudo, Taku and Richardson, John. SentencePiece : A simple and language independent subword tokenizer and detokenizer for Neural Text Processing. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. 2018. doi:10.18653/v1/D18-2012

work page internal anchor Pith review doi:10.18653/v1/d18-2012 2018
[17]

Combinatorial Pattern Matching , pages=

Linear-Time Longest-Common-Prefix Computation in Suffix Arrays and Its Applications , author=. Combinatorial Pattern Matching , pages=. 2001 , organization=

2001
[18]

2505.24689 , archivePrefix=

Sander Land and Catherine Arnett , year=. 2505.24689 , archivePrefix=

arXiv
[19]

BPE Gets Picky: Efficient Vocabulary Refinement During Tokenizer Training

Chizhov, Pavel and Arnett, Catherine and Korotkova, Elizaveta and Yamshchikov, Ivan P. BPE Gets Picky: Efficient Vocabulary Refinement During Tokenizer Training. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.925

work page doi:10.18653/v1/2024.emnlp-main.925 2024
[20]

and Bergen, Benjamin

Arnett, Catherine and Chang, Tyler A. and Bergen, Benjamin. Proceedings of the 3rd Annual Meeting of the Special Interest Group on Under-resourced Languages @ LREC-COLING 2024. 2024

2024
[21]

Neural machine translation of rare words with subword units

Sennrich, Rico and Haddow, Barry and Birch, Alexandra. Neural Machine Translation of Rare Words with Subword Units. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2016. doi:10.18653/v1/P16-1162

work page doi:10.18653/v1/p16-1162 2016
[22]

Proceedings of the 31st International Conference on Computational Linguistics

Velayuthan, Menan and Sarveswaran, Kengatharaiyer. Proceedings of the 31st International Conference on Computational Linguistics. 2025

2025
[23]

Pre-tokenization on punctuation in

Sander Land , year=. Pre-tokenization on punctuation in
[24]

Catherine Arnett , year=
[25]

A Monolingual Approach to Contextualized Word Embeddings for Mid-Resource Languages

Ortiz Suarez, Pedro Javier and Romary, Laurent and Sagot, Benoit. A Monolingual Approach to Contextualized Word Embeddings for Mid-Resource Languages. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020

2020
[26]

Asynchronous pipelines for processing huge corpora on medium to low resource infrastructures , series =

Pedro Javier. Asynchronous pipelines for processing huge corpora on medium to low resource infrastructures , series =. 2019 , language =. doi:10.14618/ids-pub-9021 , url =

work page doi:10.14618/ids-pub-9021 2019
[27]

Rossi and Thien Huu Nguyen , year=

Thuat Nguyen and Chien Van Nguyen and Viet Dac Lai and Hieu Man and Nghia Trung Ngo and Franck Dernoncourt and Ryan A. Rossi and Thien Huu Nguyen , year=. 2309.09400 , archivePrefix=

arXiv
[28]

Jamo-Level Subword Tokenization in Low-Resource K orean Machine Translation

Lee, Junyoung and Cognetta, Marco and Moon, Sangwhan and Okazaki, Naoaki. Jamo-Level Subword Tokenization in Low-Resource K orean Machine Translation. Proceedings of the Eighth Workshop on Technologies for Machine Translation of Low-Resource Languages (LoResMT 2025). 2025

2025
[29]

doi: 10.18653/v1/2024.emnlp-main.649

Land, Sander and Bartolo, Max. Fishing for M agikarp: Automatically Detecting Under-trained Tokens in Large Language Models. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.649

work page doi:10.18653/v1/2024.emnlp-main.649 2024
[30]

Rush , year=

Thomas Wolf and Lysandre Debut and Victor Sanh and Julien Chaumond and Clement Delangue and Anthony Moi and Pierric Cistac and Tim Rault and Rémi Louf and Morgan Funtowicz and Joe Davison and Sam Shleifer and Patrick von Platen and Clara Ma and Yacine Jernite and Julien Plu and Canwen Xu and Teven Le Scao and Sylvain Gugger and Mariama Drame and Quentin L...

Pith/arXiv arXiv 1910
[31]

2017 , eprint=

Tying Word Vectors and Word Classifiers: A Loss Framework for Language Modeling , author=. 2017 , eprint=

2017
[32]

Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates

Kudo, Taku. Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2018. doi:10.18653/v1/P18-1007

work page doi:10.18653/v1/p18-1007 2018
[33]

BPE -Dropout: Simple and Effective Subword Regularization

Provilkov, Ivan and Emelianenko, Dmitrii and Voita, Elena. BPE -Dropout: Simple and Effective Subword Regularization. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020. doi:10.18653/v1/2020.acl-main.170

work page doi:10.18653/v1/2020.acl-main.170 2020
[34]

Mielke and Zaid Alyafeai and Elizabeth Salesky and Colin Raffel and Manan Dey and Matthias Gallé and Arun Raja and Chenglei Si and Wilson Y

Sabrina J. Mielke and Zaid Alyafeai and Elizabeth Salesky and Colin Raffel and Manan Dey and Matthias Gallé and Arun Raja and Chenglei Si and Wilson Y. Lee and Benoît Sagot and Samson Tan , year=. Between words and characters: A Brief History of Open-Vocabulary Modeling and Tokenization in. 2112.10508 , archivePrefix=

arXiv
[35]

Too Much in Common: Shifting of Embeddings in Transformer Language Models and its Implications

Bi \'s , Daniel and Podkorytov, Maksim and Liu, Xiuwen. Too Much in Common: Shifting of Embeddings in Transformer Language Models and its Implications. Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2021. doi:10.18653/v1/2021.naacl-main.403

work page doi:10.18653/v1/2021.naacl-main.403 2021
[36]

2025 , url=

Alisa Liu and Jonathan Hayase and Valentin Hofmann and Sewoong Oh and Noah A Smith and Yejin Choi , booktitle=. 2025 , url=

2025
[37]

2025 , eprint=

Parity-Aware Byte-Pair Encoding: Improving Cross-lingual Fairness in Tokenization , author=. 2025 , eprint=

2025
[38]

2026 , url =

Proceedings of the Fourteenth International Conference on Learning Representations , author =. 2026 , url =

2026
[39]

Language Model Tokenizers Introduce Unfairness Between Languages , url =

Petrov, Aleksandar and La Malfa, Emanuele and Torr, Philip and Bibi, Adel , booktitle =. Language Model Tokenizers Introduce Unfairness Between Languages , url =
[40]

Do All Languages Cost the Same? T okenization in the Era of Commercial Language Models

Ahia, Orevaoghene and Kumar, Sachin and Gonen, Hila and Kasai, Jungo and Mortensen, David and Smith, Noah and Tsvetkov, Yulia. Do All Languages Cost the Same? Tokenization in the Era of Commercial Language Models. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023. doi:10.18653/v1/2023.emnlp-main.614

work page doi:10.18653/v1/2023.emnlp-main.614 2023
[41]

Tokenizer Choice For LLM Training: Negligible or Crucial?

Ali, Mehdi and Fromm, Michael and Thellmann, Klaudia and Rutmann, Richard and L \"u bbering, Max and Leveling, Johannes and Klug, Katrin and Ebert, Jan and Doll, Niclas and Buschhoff, Jasper and Jain, Charvi and Weber, Alexander and Jurkschat, Lena and Abdelwahab, Hammam and John, Chelsea and Ortiz Suarez, Pedro and Ostendorff, Malte and Weinbach, Samuel ...

work page doi:10.18653/v1/2024.findings-naacl.247 2024
[42]

Lotz, Ant \'o nio V

Lotz, Jonas F. and Lopes, Ant \'o nio V. and Peitz, Stephan and Setiawan, Hendra and Emili, Leonardo. Beyond Text Compression: Evaluating Tokenizers Across Scales. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025. doi:10.18653/v1/2025.acl-long.1546

work page doi:10.18653/v1/2025.acl-long.1546 2025
[43]

Why do language models perform worse for morphologically complex languages?

Arnett, Catherine and Bergen, Benjamin. Why do language models perform worse for morphologically complex languages?. Proceedings of the 31st International Conference on Computational Linguistics. 2025

2025
[44]

doi: 10.18653/v1/2021.acl-long.243

Rust, Phillip and Pfeiffer, Jonas and Vuli \'c , Ivan and Ruder, Sebastian and Gurevych, Iryna. How Good is Your Tokenizer? On the Monolingual Performance of Multilingual Language Models. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volum...

work page doi:10.18653/v1/2021.acl-long.243 2021
[45]

Superbizarre Is Not Superb: Derivational Morphology Improves BERT ' s Interpretation of Complex Words

Hofmann, Valentin and Pierrehumbert, Janet and Sch. Superbizarre Is Not Superb: Derivational Morphology Improves BERT ' s Interpretation of Complex Words. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 2021. doi:10.1...

work page doi:10.18653/v1/2021.acl-long.279 2021
[46]

2026 , eprint=

Tokenisation via Convex Relaxations , author=. 2026 , eprint=

2026
[47]

2026 , eprint=

Goldfish: Monolingual Language Models for 350 Languages , author=. 2026 , eprint=

2026
[48]

2025 , isbn =

Lian, Haoran and Xiong, Yizhe and Niu, Jianwei and Mo, Shasha and Su, Zhenpeng and Lin, Zijia and Chen, Hui and Han, Jungong and Ding, Guiguang , title =. 2025 , isbn =. doi:10.1609/aaai.v39i23.34633 , booktitle =

work page doi:10.1609/aaai.v39i23.34633 2025
[49]

The U niversity of E dinburgh ' s Neural MT Systems for WMT 17

Sennrich, Rico and Birch, Alexandra and Currey, Anna and Germann, Ulrich and Haddow, Barry and Heafield, Kenneth and Miceli Barone, Antonio Valerio and Williams, Philip. The U niversity of E dinburgh ' s Neural MT Systems for WMT 17. Proceedings of the Second Conference on Machine Translation. 2017. doi:10.18653/v1/W17-4739

work page doi:10.18653/v1/w17-4739 2017
[50]

Second Conference on Language Modeling , year=

Boundless Byte Pair Encoding: Breaking the Pre-tokenization Barrier , author=. Second Conference on Language Modeling , year=
[51]

Investigating the Effectiveness of BPE : The Power of Shorter Sequences

Gall \'e , Matthias. Investigating the Effectiveness of BPE : The Power of Shorter Sequences. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2019. doi:10.18653/v1/D19-1141

work page doi:10.18653/v1/d19-1141 2019
[52]

2026 , eprint=

Tokenization with Split Trees , author=. 2026 , eprint=

2026
[53]

Nemotron-

Shizhe Diao and Yu Yang and Yonggan Fu and Xin Dong and Dan Su and Markus Kliegl and Zijia Chen and Peter Belcak and Yoshi Suhara and Hongxu Yin and Mostofa Patwary and Yingyan Lin and Jan Kautz and Pavlo Molchanov , year=. Nemotron-. 2504.13161 , archivePrefix=

Pith/arXiv arXiv
[54]

2025 , publisher =

Andrej Karpathy , title =. 2025 , publisher =

2025
[55]

and Carmon, Yair and Dave, Achal and Schmidt, Ludwig and Shankar, Vaishaal , booktitle =

Li, Jeffrey and Fang, Alex and Smyrnis, Georgios and Ivgi, Maor and Jordan, Matt and Gadre, Samir and Bansal, Hritik and Guha, Etash and Keh, Sedrick and Arora, Kushal and Garg, Saurabh and Xin, Rui and Muennighoff, Niklas and Heckel, Reinhard and Mercat, Jean and Chen, Mayee and Gururangan, Suchin and Wortsman, Mitchell and Albalak, Alon and Bitton, Yona...

work page doi:10.52202/079017-0455

[1] [1]

Schmidt, Varshini Reddy, Haoran Zhang, Alec Alameddine, Omri Uzan, Yuval Pinter, and Chris Tanner

Schmidt, Craig W and Reddy, Varshini and Zhang, Haoran and Alameddine, Alec and Uzan, Omri and Pinter, Yuval and Tanner, Chris. Tokenization Is More Than Compression. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.40

work page doi:10.18653/v1/2024.emnlp-main.40 2024

[2] [2]

and Tanner, Chris and Pinter, Yuval

Uzan, Omri and Schmidt, Craig W. and Tanner, Chris and Pinter, Yuval. Greed is All You Need: An Evaluation of Tokenizer Inference Methods. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 2024. doi:10.18653/v1/2024.acl-short.73

work page doi:10.18653/v1/2024.acl-short.73 2024

[3] [3]

Tokenization and the Noiseless Channel

Zouhar, Vil \'e m and Meister, Clara and Gastaldi, Juan and Du, Li and Sachan, Mrinmaya and Cotterell, Ryan. Tokenization and the Noiseless Channel. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023. doi:10.18653/v1/2023.acl-long.284

work page doi:10.18653/v1/2023.acl-long.284 2023

[4] [4]

Two Counterexamples to Tokenization and the Noiseless Channel

Cognetta, Marco and Zouhar, Vil \'e m and Moon, Sangwhan and Okazaki, Naoaki. Two Counterexamples to Tokenization and the Noiseless Channel. Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024). 2024

2024

[5] [5]

NLLB Team and Costa-juss \`a , Marta R. and Cross, James and C elebi, Onur and Elbayad, Maha and Heafield, Kenneth and Heffernan, Kevin and Kalbassi, Elahe and Lam, Janice and Licht, Daniel and Maillard, Jean and Sun, Anna and Wang, Skyler and Wenzek, Guillaume and Youngblood, Al and Akula, Bapi and Barrault, Loic and Gonzalez, Gabriel Mejia and Hansanti,...

work page doi:10.1038/s41586-024-07335-x 2024

[6] [6]

Byte pair encoding is suboptimal for language model pretraining

Bostrom, Kaj and Durrett, Greg. Byte Pair Encoding is Suboptimal for Language Model Pretraining. Findings of the Association for Computational Linguistics: EMNLP 2020. 2020. doi:10.18653/v1/2020.findings-emnlp.414

work page doi:10.18653/v1/2020.findings-emnlp.414 2020

[7] [7]

Incorporating Context into Subword Vocabularies

Yehezkel, Shaked and Pinter, Yuval. Incorporating Context into Subword Vocabularies. Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics. 2023. doi:10.18653/v1/2023.eacl-main.45

work page doi:10.18653/v1/2023.eacl-main.45 2023

[8] [8]

Rethinking Tokenization for Rich Morphology: The Dominance of U nigram over BPE and Morphological Alignment

Vemula, Saketh Reddy and Dandapat, Sandipan and Sharma, Dipti and Krishnamurthy, Parameswari. Rethinking Tokenization for Rich Morphology: The Dominance of U nigram over BPE and Morphological Alignment. The 14th International Joint Conference on Natural Language Processing and The 4th Conference of the Asia-Pacific Chapter of the Association for Computati...

work page doi:10.18653/v1/2025.ijcnlp-srw.20 2025

[9] [9]

2502.00894 , archivePrefix=

Ehsaneddin Asgari and Yassine El Kheir and Mohammad Ali Sadraei Javaheri , year=. 2502.00894 , archivePrefix=

arXiv

[10] [10]

Which Pieces Does

Sander Land and Yuval Pinter , year=. Which Pieces Does. 2512.12641 , archivePrefix=

Pith/arXiv arXiv

[11] [11]

2025 , publisher =

Guilherme Penedo , title =. 2025 , publisher =

2025

[12] [12]

2020 , eprint=

BPE-Dropout: Simple and Effective Subword Regularization , author=. 2020 , eprint=

2020

[13] [13]

Evaluating Morphological Plausibility of Subword Tokenization via Statistical Alignment with Morpho-Syntactic Features

Stephen, Abishek and Libovick \'y , Jind r ich. Evaluating Morphological Plausibility of Subword Tokenization via Statistical Alignment with Morpho-Syntactic Features. Findings of the A ssociation for C omputational L inguistics: EACL 2026. 2026. doi:10.18653/v1/2026.findings-eacl.196

work page doi:10.18653/v1/2026.findings-eacl.196 2026

[14] [14]

2025 , eprint=

Evaluating Morphological Alignment of Tokenizers in 70 Languages , author=. 2025 , eprint=

2025

[15] [15]

A. P. Dempster and N. M. Laird and D. B. Rubin , journal =. Maximum Likelihood from Incomplete Data via the

[16] [16]

SentencePiece: a simple and language independent subword tokenizer and detokenizer for neural text processing

Kudo, Taku and Richardson, John. SentencePiece : A simple and language independent subword tokenizer and detokenizer for Neural Text Processing. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. 2018. doi:10.18653/v1/D18-2012

work page internal anchor Pith review doi:10.18653/v1/d18-2012 2018

[17] [17]

Combinatorial Pattern Matching , pages=

Linear-Time Longest-Common-Prefix Computation in Suffix Arrays and Its Applications , author=. Combinatorial Pattern Matching , pages=. 2001 , organization=

2001

[18] [18]

2505.24689 , archivePrefix=

Sander Land and Catherine Arnett , year=. 2505.24689 , archivePrefix=

arXiv

[19] [19]

BPE Gets Picky: Efficient Vocabulary Refinement During Tokenizer Training

Chizhov, Pavel and Arnett, Catherine and Korotkova, Elizaveta and Yamshchikov, Ivan P. BPE Gets Picky: Efficient Vocabulary Refinement During Tokenizer Training. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.925

work page doi:10.18653/v1/2024.emnlp-main.925 2024

[20] [20]

and Bergen, Benjamin

Arnett, Catherine and Chang, Tyler A. and Bergen, Benjamin. Proceedings of the 3rd Annual Meeting of the Special Interest Group on Under-resourced Languages @ LREC-COLING 2024. 2024

2024

[21] [21]

Neural machine translation of rare words with subword units

Sennrich, Rico and Haddow, Barry and Birch, Alexandra. Neural Machine Translation of Rare Words with Subword Units. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2016. doi:10.18653/v1/P16-1162

work page doi:10.18653/v1/p16-1162 2016

[22] [22]

Proceedings of the 31st International Conference on Computational Linguistics

Velayuthan, Menan and Sarveswaran, Kengatharaiyer. Proceedings of the 31st International Conference on Computational Linguistics. 2025

2025

[23] [23]

Pre-tokenization on punctuation in

Sander Land , year=. Pre-tokenization on punctuation in

[24] [24]

Catherine Arnett , year=

[25] [25]

A Monolingual Approach to Contextualized Word Embeddings for Mid-Resource Languages

Ortiz Suarez, Pedro Javier and Romary, Laurent and Sagot, Benoit. A Monolingual Approach to Contextualized Word Embeddings for Mid-Resource Languages. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020

2020

[26] [26]

Asynchronous pipelines for processing huge corpora on medium to low resource infrastructures , series =

Pedro Javier. Asynchronous pipelines for processing huge corpora on medium to low resource infrastructures , series =. 2019 , language =. doi:10.14618/ids-pub-9021 , url =

work page doi:10.14618/ids-pub-9021 2019

[27] [27]

Rossi and Thien Huu Nguyen , year=

Thuat Nguyen and Chien Van Nguyen and Viet Dac Lai and Hieu Man and Nghia Trung Ngo and Franck Dernoncourt and Ryan A. Rossi and Thien Huu Nguyen , year=. 2309.09400 , archivePrefix=

arXiv

[28] [28]

Jamo-Level Subword Tokenization in Low-Resource K orean Machine Translation

Lee, Junyoung and Cognetta, Marco and Moon, Sangwhan and Okazaki, Naoaki. Jamo-Level Subword Tokenization in Low-Resource K orean Machine Translation. Proceedings of the Eighth Workshop on Technologies for Machine Translation of Low-Resource Languages (LoResMT 2025). 2025

2025

[29] [29]

doi: 10.18653/v1/2024.emnlp-main.649

Land, Sander and Bartolo, Max. Fishing for M agikarp: Automatically Detecting Under-trained Tokens in Large Language Models. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.649

work page doi:10.18653/v1/2024.emnlp-main.649 2024

[30] [30]

Rush , year=

Thomas Wolf and Lysandre Debut and Victor Sanh and Julien Chaumond and Clement Delangue and Anthony Moi and Pierric Cistac and Tim Rault and Rémi Louf and Morgan Funtowicz and Joe Davison and Sam Shleifer and Patrick von Platen and Clara Ma and Yacine Jernite and Julien Plu and Canwen Xu and Teven Le Scao and Sylvain Gugger and Mariama Drame and Quentin L...

Pith/arXiv arXiv 1910

[31] [31]

2017 , eprint=

Tying Word Vectors and Word Classifiers: A Loss Framework for Language Modeling , author=. 2017 , eprint=

2017

[32] [32]

Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates

Kudo, Taku. Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2018. doi:10.18653/v1/P18-1007

work page doi:10.18653/v1/p18-1007 2018

[33] [33]

BPE -Dropout: Simple and Effective Subword Regularization

Provilkov, Ivan and Emelianenko, Dmitrii and Voita, Elena. BPE -Dropout: Simple and Effective Subword Regularization. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020. doi:10.18653/v1/2020.acl-main.170

work page doi:10.18653/v1/2020.acl-main.170 2020

[34] [34]

Mielke and Zaid Alyafeai and Elizabeth Salesky and Colin Raffel and Manan Dey and Matthias Gallé and Arun Raja and Chenglei Si and Wilson Y

Sabrina J. Mielke and Zaid Alyafeai and Elizabeth Salesky and Colin Raffel and Manan Dey and Matthias Gallé and Arun Raja and Chenglei Si and Wilson Y. Lee and Benoît Sagot and Samson Tan , year=. Between words and characters: A Brief History of Open-Vocabulary Modeling and Tokenization in. 2112.10508 , archivePrefix=

arXiv

[35] [35]

Too Much in Common: Shifting of Embeddings in Transformer Language Models and its Implications

Bi \'s , Daniel and Podkorytov, Maksim and Liu, Xiuwen. Too Much in Common: Shifting of Embeddings in Transformer Language Models and its Implications. Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2021. doi:10.18653/v1/2021.naacl-main.403

work page doi:10.18653/v1/2021.naacl-main.403 2021

[36] [36]

2025 , url=

Alisa Liu and Jonathan Hayase and Valentin Hofmann and Sewoong Oh and Noah A Smith and Yejin Choi , booktitle=. 2025 , url=

2025

[37] [37]

2025 , eprint=

Parity-Aware Byte-Pair Encoding: Improving Cross-lingual Fairness in Tokenization , author=. 2025 , eprint=

2025

[38] [38]

2026 , url =

Proceedings of the Fourteenth International Conference on Learning Representations , author =. 2026 , url =

2026

[39] [39]

Language Model Tokenizers Introduce Unfairness Between Languages , url =

Petrov, Aleksandar and La Malfa, Emanuele and Torr, Philip and Bibi, Adel , booktitle =. Language Model Tokenizers Introduce Unfairness Between Languages , url =

[40] [40]

Do All Languages Cost the Same? T okenization in the Era of Commercial Language Models

Ahia, Orevaoghene and Kumar, Sachin and Gonen, Hila and Kasai, Jungo and Mortensen, David and Smith, Noah and Tsvetkov, Yulia. Do All Languages Cost the Same? Tokenization in the Era of Commercial Language Models. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023. doi:10.18653/v1/2023.emnlp-main.614

work page doi:10.18653/v1/2023.emnlp-main.614 2023

[41] [41]

Tokenizer Choice For LLM Training: Negligible or Crucial?

Ali, Mehdi and Fromm, Michael and Thellmann, Klaudia and Rutmann, Richard and L \"u bbering, Max and Leveling, Johannes and Klug, Katrin and Ebert, Jan and Doll, Niclas and Buschhoff, Jasper and Jain, Charvi and Weber, Alexander and Jurkschat, Lena and Abdelwahab, Hammam and John, Chelsea and Ortiz Suarez, Pedro and Ostendorff, Malte and Weinbach, Samuel ...

work page doi:10.18653/v1/2024.findings-naacl.247 2024

[42] [42]

Lotz, Ant \'o nio V

Lotz, Jonas F. and Lopes, Ant \'o nio V. and Peitz, Stephan and Setiawan, Hendra and Emili, Leonardo. Beyond Text Compression: Evaluating Tokenizers Across Scales. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025. doi:10.18653/v1/2025.acl-long.1546

work page doi:10.18653/v1/2025.acl-long.1546 2025

[43] [43]

Why do language models perform worse for morphologically complex languages?

Arnett, Catherine and Bergen, Benjamin. Why do language models perform worse for morphologically complex languages?. Proceedings of the 31st International Conference on Computational Linguistics. 2025

2025

[44] [44]

doi: 10.18653/v1/2021.acl-long.243

Rust, Phillip and Pfeiffer, Jonas and Vuli \'c , Ivan and Ruder, Sebastian and Gurevych, Iryna. How Good is Your Tokenizer? On the Monolingual Performance of Multilingual Language Models. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volum...

work page doi:10.18653/v1/2021.acl-long.243 2021

[45] [45]

Superbizarre Is Not Superb: Derivational Morphology Improves BERT ' s Interpretation of Complex Words

Hofmann, Valentin and Pierrehumbert, Janet and Sch. Superbizarre Is Not Superb: Derivational Morphology Improves BERT ' s Interpretation of Complex Words. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 2021. doi:10.1...

work page doi:10.18653/v1/2021.acl-long.279 2021

[46] [46]

2026 , eprint=

Tokenisation via Convex Relaxations , author=. 2026 , eprint=

2026

[47] [47]

2026 , eprint=

Goldfish: Monolingual Language Models for 350 Languages , author=. 2026 , eprint=

2026

[48] [48]

2025 , isbn =

Lian, Haoran and Xiong, Yizhe and Niu, Jianwei and Mo, Shasha and Su, Zhenpeng and Lin, Zijia and Chen, Hui and Han, Jungong and Ding, Guiguang , title =. 2025 , isbn =. doi:10.1609/aaai.v39i23.34633 , booktitle =

work page doi:10.1609/aaai.v39i23.34633 2025

[49] [49]

The U niversity of E dinburgh ' s Neural MT Systems for WMT 17

Sennrich, Rico and Birch, Alexandra and Currey, Anna and Germann, Ulrich and Haddow, Barry and Heafield, Kenneth and Miceli Barone, Antonio Valerio and Williams, Philip. The U niversity of E dinburgh ' s Neural MT Systems for WMT 17. Proceedings of the Second Conference on Machine Translation. 2017. doi:10.18653/v1/W17-4739

work page doi:10.18653/v1/w17-4739 2017

[50] [50]

Second Conference on Language Modeling , year=

Boundless Byte Pair Encoding: Breaking the Pre-tokenization Barrier , author=. Second Conference on Language Modeling , year=

[51] [51]

Investigating the Effectiveness of BPE : The Power of Shorter Sequences

Gall \'e , Matthias. Investigating the Effectiveness of BPE : The Power of Shorter Sequences. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2019. doi:10.18653/v1/D19-1141

work page doi:10.18653/v1/d19-1141 2019

[52] [52]

2026 , eprint=

Tokenization with Split Trees , author=. 2026 , eprint=

2026

[53] [53]

Nemotron-

Shizhe Diao and Yu Yang and Yonggan Fu and Xin Dong and Dan Su and Markus Kliegl and Zijia Chen and Peter Belcak and Yoshi Suhara and Hongxu Yin and Mostofa Patwary and Yingyan Lin and Jan Kautz and Pavlo Molchanov , year=. Nemotron-. 2504.13161 , archivePrefix=

Pith/arXiv arXiv

[54] [54]

2025 , publisher =

Andrej Karpathy , title =. 2025 , publisher =

2025

[55] [55]

and Carmon, Yair and Dave, Achal and Schmidt, Ludwig and Shankar, Vaishaal , booktitle =

Li, Jeffrey and Fang, Alex and Smyrnis, Georgios and Ivgi, Maor and Jordan, Matt and Gadre, Samir and Bansal, Hritik and Guha, Etash and Keh, Sedrick and Arora, Kushal and Garg, Saurabh and Xin, Rui and Muennighoff, Niklas and Heckel, Reinhard and Mercat, Jean and Chen, Mayee and Gururangan, Suchin and Wortsman, Mitchell and Albalak, Alon and Bitton, Yona...

work page doi:10.52202/079017-0455