FLEXITOKENS: Flexible Tokenization for Evolving Language Models

Abraham Toluwase Owodunni; Orevaoghene Ahia; Sachin Kumar

arxiv: 2507.12720 · v4 · pith:57GQC52Nnew · submitted 2025-07-17 · 💻 cs.CL

FLEXITOKENS: Flexible Tokenization for Evolving Language Models

Abraham Toluwase Owodunni , Orevaoghene Ahia , Sachin Kumar This is my paper

Pith reviewed 2026-05-19 05:07 UTC · model grok-4.3

classification 💻 cs.CL

keywords flexible tokenizationadaptive tokenizersbyte-level language modelsboundary predictionlanguage model adaptationtokenizer-free methodsmultilingual benchmarksover-fragmentation

0 comments

The pith

A simplified training objective lets byte-level language models learn flexible token boundaries that adapt to new domains and languages without fixed compression constraints.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Language models face adaptation challenges because fixed subword tokenizers cause over-fragmentation when encountering new distributions, languages, or scripts. The work builds byte-level models that include a submodule to predict variable-length token boundaries directly from byte sequences. Prior tokenizer-free methods add an auxiliary loss to enforce a constant compression rate across the corpus, which creates its own rigidity. FLEXITOKENS replaces this with a simpler objective that removes the fixed-rate constraint, allowing the boundary predictor to learn more suitable segmentations for the current data. Across multilingual benchmarks, morphologically rich tasks, and domain shifts, the approach reduces over-fragmentation and delivers up to 10 percentage point gains on both classification and generation compared with BPE and other gradient-based tokenizers, with gains holding across model scales.

Core claim

By training the boundary predictor with a simplified objective that omits the auxiliary fixed-compression loss, byte-level language models acquire tokenizations that adapt to the input distribution rather than remaining locked to a preset compression rate, yielding lower over-fragmentation and higher downstream accuracy on out-of-distribution data.

What carries the argument

The FLEXITOKENS simplified training objective for the byte-sequence boundary predictor, which learns to insert variable-length token boundaries without an auxiliary term that enforces fixed corpus-wide compression.

If this is right

Token over-fragmentation decreases on out-of-distribution domains and morphologically diverse languages.
Downstream token classification and generative tasks improve by as much as 10 percentage points relative to BPE and other gradient-based tokenizer baselines.
Performance gains remain consistent when the same method is applied to models of different sizes.
Adaptation to new data distributions becomes possible through finetuning alone without manual tokenizer replacement.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Models could continue learning new segmentation strategies as they encounter fresh data streams over time, supporting longer-term evolution without periodic tokenizer retraining.
The removal of the fixed-compression auxiliary term may simplify extension to low-resource languages or mixed-script text that current fixed tokenizers handle poorly.
Similar simplification of auxiliary objectives could be tested in other adaptive components such as dynamic vocabulary growth or on-the-fly embedding updates.

Load-bearing premise

The boundary predictor trained with the simplified objective will still produce segmentations useful for downstream performance rather than collapsing to trivial or degenerate boundaries.

What would settle it

A controlled experiment on new domains or unseen scripts in which the FLEXITOKENS model shows no reduction in over-fragmentation and no accuracy gains over BPE or fixed-compression baselines would falsify the central claim.

Figures

Figures reproduced from arXiv: 2507.12720 by Abraham Toluwase Owodunni, Orevaoghene Ahia, Sachin Kumar.

**Figure 2.** Figure 2: FineWeb Test BPB (↓), Compression rate (↑) and Compression variance (↑) of FlexıTokens compared to the bınomıal variant with αA = 0.3 and λ = 3. Higher compression rates result in fewer tokens, which in turn leads to a more efficient model. Overall, FlexıTokens 1B model achieves the best score across all metrics 7We use a shorter sequence length during pretraining due to computational constraints. 8Note th… view at source ↗

**Figure 3.** Figure 3: Average number of tokens per sample obtained in the FLORES dataset with different tokenization algorithms. FlexıTokens consistently produces the least number of tokens while maintaining balance across languages, even for the unseen language Urdu. BPE over-fragments seen (Hindi, Telugu) as well as unseen languages (Urdu). We also observe a higher variance in compression rates of FlexıTokens implying higher… view at source ↗

**Figure 4.** Figure 4: Compression rate changes with FlexıTokens across multiple tasks. Initial is the base compression rate before pretraining. Compression rate for bınomıal remains relatively low while we also see a spike for task like XNLI en es ru uk hi te Language 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 Bits per Bytes 1.593 1.443 0.885 0.894 0.694 0.662 1.592 1.443 0.878 0.885 0.701 0.655 1.585 1.435 0.885 0.894 0.690 … view at source ↗

**Figure 5.** Figure 5: FineWeb Test results for ablating the number layers in [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Number of training documents sampled by language [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗

read the original abstract

Adapting language models to new data distributions by simple finetuning is challenging. This is due to the rigidity of their subword tokenizers, which typically remain unchanged during adaptation. This inflexibility often leads to inefficient tokenization, causing overfragmentation of text in out-of-distribution domains, unseen languages, or scripts. In this work, we develop byte-level LMs with learnable tokenizers to make tokenization adaptive. Our models include a submodule that learns to predict boundaries given the input byte sequence, encoding it into variable-length segments. Most tokenizer-free methods train this boundary predictor using an auxiliary loss that enforces a fixed compression rate across the training corpus, introducing a new kind of rigidity. We propose FLEXITOKENS, a simplified training objective that enables significantly greater flexibility during adaptation. Evaluating across multiple multilingual benchmarks, morphologically diverse tasks, and domains, we demonstrate that FLEXITOKENS consistently reduces token over-fragmentation and achieves up to 10% point improvements on token classification and generative tasks compared to BPE and other gradient-based tokenizer baselines. We validate our findings using models of varying sizes, and our method demonstrates consistent improvements across scales. Code and data for our experiments will be released at https://github.com/skai-research/flexitokens

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FLEXITOKENS drops the auxiliary compression loss for learnable byte boundaries and claims downstream gains, but the abstract gives no evidence the predictor avoids trivial fixed policies.

read the letter

The main thing here is that FLEXITOKENS removes the fixed-compression auxiliary loss that earlier learnable tokenizer work used to train the boundary predictor. They keep the predictor but train it only on the downstream task loss, and they report that this produces less over-fragmentation plus gains of up to 10 points on token classification and generation across multilingual and domain-shift benchmarks. The consistency across model sizes is the part that looks most useful on first read.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces FLEXITOKENS, a byte-level language model with a learnable boundary predictor for variable-length tokenization. It proposes a simplified training objective that removes the auxiliary fixed-compression loss common in prior tokenizer-free approaches, claiming this yields greater flexibility during adaptation to new distributions, languages, or domains. The authors report that the method reduces token over-fragmentation and delivers up to 10 percentage point gains on token classification and generative tasks relative to BPE and gradient-based tokenizer baselines, with consistent improvements across model scales and multilingual/morphologically diverse benchmarks. Code and data release is promised.

Significance. If the central empirical claims are substantiated, the work could meaningfully advance adaptive tokenization for evolving language models by addressing subword rigidity in OOD settings. The simplification of the objective to remove an auxiliary constraint is a clear conceptual contribution. Credit is due for the cross-scale validation and the commitment to public code and data release, which directly supports reproducibility in this area.

major comments (2)

[Method] Method section (description of the simplified objective): the boundary predictor is trained solely on the main task loss without the auxiliary compression term. No analysis or ablation is provided to demonstrate that this does not lead to degenerate fixed policies (always-emit or never-emit boundaries), which would directly undermine the claimed reduction in over-fragmentation and downstream gains. This is load-bearing for the headline performance claims.
[Experiments] Experiments section: the abstract states consistent improvements and up to 10-point gains, yet the reported results lack details on statistical significance testing, precise baseline re-implementations, data splits, or explicit ablations isolating the effect of removing the auxiliary loss. These omissions prevent full verification of the cross-scale and cross-task claims.

minor comments (1)

[Abstract] The abstract would be clearer if it briefly noted the range of model sizes used for the scale-consistency validation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback and for highlighting areas where additional evidence would strengthen the manuscript. We address each major comment below and commit to revisions that directly respond to the concerns raised.

read point-by-point responses

Referee: [Method] Method section (description of the simplified objective): the boundary predictor is trained solely on the main task loss without the auxiliary compression term. No analysis or ablation is provided to demonstrate that this does not lead to degenerate fixed policies (always-emit or never-emit boundaries), which would directly undermine the claimed reduction in over-fragmentation and downstream gains. This is load-bearing for the headline performance claims.

Authors: We agree that an explicit demonstration that the simplified objective avoids degenerate boundary policies is important for substantiating the core claims. In the current manuscript we relied on the fact that the boundary predictor is optimized jointly with the language modeling loss, which should penalize policies that produce uninformative segmentations (as these would increase perplexity). However, this argument is indirect. To address the referee's point directly, we will add an ablation in the revised manuscript that compares the learned boundary distributions against forced always-emit and never-emit baselines, along with the resulting tokenization statistics and downstream performance. This analysis will be placed in the method or experiments section. revision: yes
Referee: [Experiments] Experiments section: the abstract states consistent improvements and up to 10-point gains, yet the reported results lack details on statistical significance testing, precise baseline re-implementations, data splits, or explicit ablations isolating the effect of removing the auxiliary loss. These omissions prevent full verification of the cross-scale and cross-task claims.

Authors: We acknowledge that the current experimental reporting is insufficient for full verification. In the revision we will expand the experiments section to include: (i) statistical significance testing (multiple random seeds with reported means, standard deviations, and p-values where appropriate), (ii) precise descriptions of baseline re-implementations including all hyperparameters and training details, (iii) explicit documentation of data splits and preprocessing, and (iv) a dedicated ablation that isolates the contribution of removing the auxiliary compression loss. These additions will allow readers to reproduce and verify the cross-scale and cross-task results. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical proposal with independent experimental validation

full rationale

The paper introduces FLEXITOKENS as a new simplified training objective for a byte-level boundary predictor, removing the auxiliary fixed-compression loss used in prior tokenizer-free methods. All performance claims (reduced over-fragmentation, up to 10-point gains on classification and generation tasks) are supported by direct empirical evaluation on multilingual benchmarks, morphologically diverse tasks, and domain shifts across model scales. No equations, derivations, or fitted parameters are presented as 'predictions' that reduce to the inputs by construction. No self-citations, uniqueness theorems, or ansatzes from prior author work are invoked as load-bearing justification. The central result is therefore an empirical training change whose validity rests on external benchmark outcomes rather than any self-referential reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard assumptions of gradient-based optimization for sequence models and the premise that byte sequences contain sufficient signal for useful boundary prediction; no new free parameters or invented entities are introduced beyond the boundary predictor submodule itself.

axioms (1)

domain assumption Byte sequences provide a complete and sufficient representation for learning token boundaries in any script or language
Implicit foundation for all byte-level modeling in the work

pith-pipeline@v0.9.0 · 5761 in / 1343 out tokens · 51639 ms · 2026-05-19T05:07:26.857776+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean alpha_pin_under_high_calibration unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose FlexıTokens, a simplified training objective... L_BP = max(k/N − α, 0) + max(β − k/N, 0)
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

boundary predictor... hard Gumbel sigmoid re-parameterization

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Scratchpad Patching: Decoupling Compute from Patch Size in Byte-Level Language Models
cs.CL 2026-05 conditional novelty 7.0

Scratchpad Patching decouples compute from patch size in byte-level language models by inserting entropy-triggered scratchpads to update patch context dynamically.
Compute Optimal Tokenization
cs.CL 2026-05 unverdicted novelty 6.0

Compute-optimal language models require parameter count to scale with data bytes rather than tokens, with optimal token compression rate decreasing as compute budget grows.

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · cited by 2 Pith papers · 4 internal anchors

[1]

Tokenizer choice for llm training: Negligible or crucial? In Findings of the Association for Computational Linguistics: NAACL 2024, pages 3907–3924, 2024

Mehdi Ali, Michael Fromm, Klaudia Thellmann, Richard Rutmann, Max Lübbering, Johannes Leveling, Katrin Klug, Jan Ebert, Niclas Doll, Jasper Buschhoff, et al. Tokenizer choice for llm training: Negligible or crucial? In Findings of the Association for Computational Linguistics: NAACL 2024, pages 3907–3924, 2024

work page 2024
[2]

Coercing llms to do and reveal (almost) anything

Jonas Geiping, Alex Stein, Manli Shu, Khalid Saifullah, Yuxin Wen, and Tom Goldstein. Coercing llms to do and reveal (almost) anything.arXiv preprint arXiv:2402.14020, 2024

work page arXiv 2024
[3]

arXiv preprint arXiv:2405.05417 , year=

Sander Land and Max Bartolo. Fishing for magikarp: Automatically detecting under-trained tokens in large language models.arXiv preprint arXiv:2405.05417, 2024

work page arXiv 2024
[4]

Neural machine translation of rare words with subword units

Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword units. In Katrin Erk and Noah A. Smith, editors, Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1715–1725, Berlin, Germany, August 2016. Association for Computational Linguistics. d...

work page doi:10.18653/v1/p16-1162 2016
[5]

Bert: Pre-training of deep bidirectional transformers for language understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. InProceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers) , pages 4171–4186, 2019

work page 2019
[6]

Jacovi, A., Caciularu, A., Goldman, O., and Goldberg, Y

Orevaoghene Ahia, Sachin Kumar, Hila Gonen, Jungo Kasai, David Mortensen, Noah Smith, and Yulia Tsvetkov. Do all languages cost the same? tokenization in the era of commercial language models. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 9904–9923, Singa...

work page doi:10.18653/v1/2023.emnlp-main 2023
[7]

URL https://aclanthology.org/2023.emnlp-main.614/

work page 2023
[8]

Language model tokenizers introduce unfairness between languages.Advances in neural information processing systems, 36: 36963–36990, 2023

Aleksandar Petrov, Emanuele La Malfa, Philip Torr, and Adel Bibi. Language model tokenizers introduce unfairness between languages.Advances in neural information processing systems, 36: 36963–36990, 2023

work page 2023
[9]

Getting the most out of your tokenizer for pre-training and domain adaptation

Gautier Dagan, Gabriel Synnaeve, and Baptiste Rozière. Getting the most out of your tokenizer for pre-training and domain adaptation. InProceedings of the 41st International Conference on Machine Learning, ICML’24. JMLR.org, 2024. 10

work page 2024
[10]

Zero-shot tokenizer transfer.arXiv preprint arXiv:2405.07883, 2024

Benjamin Minixhofer, Edoardo Maria Ponti, and Ivan Vulić. Zero-shot tokenizer transfer.arXiv preprint arXiv:2405.07883, 2024

work page arXiv 2024
[11]

Li, J.; Huang, S.; Ching, A.; Dai, X.; and Chen, J

Haonan Li, Fajri Koto, Minghao Wu, Alham Fikri Aji, and Timothy Baldwin. Bactrian-x: Multilingual replicable instruction-following models with low-rank adaptation.arXiv preprint arXiv:2305.15011, 2023

work page arXiv 2023
[12]

ByT5: Towards a Token-Free Future with Pre-trained Byte-to-Byte Models

Linting Xue, Aditya Barua, Noah Constant, Rami Al-Rfou, Sharan Narang, Mihir Kale, Adam Roberts, and Colin Raffel. ByT5: Towards a token-free future with pre-trained byte-to-byte models. Transactions of the Association for Computational Linguistics, 10:291–306, 2022. doi: 10.1162/tacl_a_00461. URL https://aclanthology.org/2022.tacl-1.17/

work page doi:10.1162/tacl_a_00461 2022
[13]

Character-level language modeling with deeper self-attention

Rami Al-Rfou, Dokook Choe, Noah Constant, Mandy Guo, and Llion Jones. Character-level language modeling with deeper self-attention. InAAAI Conference on Artificial Intelligence, 2018. URL https://api.semanticscholar.org/CorpusID:52004855

work page 2018
[14]

Mambabyte: Token-free selective state space model

Junxiong Wang, Tushaar Gangavarapu, Jing Nathan Yan, and Alexander M Rush. Mambabyte: Token-free selective state space model. InFirst Conference on Language Modeling, 2024. URL https://openreview.net/forum?id=X1xNsuKssb

work page 2024
[15]

Charformer: Fast character transformers via gradient-based subword tokenization.arXiv preprint arXiv:2106.12672, 2021

Yi Tay, Vinh Q Tran, Sebastian Ruder, Jai Gupta, Hyung Won Chung, Dara Bahri, Zhen Qin, Simon Baumgartner, Cong Yu, and Donald Metzler. Charformer: Fast character transformers via gradient-based subword tokenization.arXiv preprint arXiv:2106.12672, 2021

work page arXiv 2021
[16]

37 Wenyue Hua, Lizhou Fan, Lingyao Li, Kai Mei, Jianchao Ji, Yingqiang Ge, Libby Hemphill, and Yongfeng Zhang

Piotr Nawrot, Szymon Tworkowski, Michał Tyrolski, Lukasz Kaiser, Yuhuai Wu, Christian Szegedy, and Henryk Michalewski. Hierarchical transformers are more efficient language models. In Marine Carpuat, Marie-Catherine de Marneffe, and Ivan Vladimir Meza Ruiz, editors,Find- ings of the Association for Computational Linguistics: NAACL 2022 , pages 1559–1571, ...

work page doi:10.18653/v1/2022 2022
[17]

Magnet: Improving the multilingual fairness of language models with adaptive gradient-based tokenization.Advances in Neural Information Processing Systems, 37: 47790–47814, 2024

Orevaoghene Ahia, Sachin Kumar, Hila Gonen, Valentin Hofmann, Tomasz Limisiewicz, Yulia Tsvetkov, and Noah A Smith. Magnet: Improving the multilingual fairness of language models with adaptive gradient-based tokenization.Advances in Neural Information Processing Systems, 37: 47790–47814, 2024

work page 2024
[18]

arXiv preprint arXiv:2412.09871 , year=

Artidoro Pagnoni, Ram Pasunuru, Pedro Rodriguez, John Nguyen, Benjamin Muller, Margaret Li, Chunting Zhou, Lili Yu, Jason Weston, Luke Zettlemoyer, Gargi Ghosh, Mike Lewis, Ari Holtzman, and Srinivasan Iyer. Byte latent transformer: Patches scale better than tokens, 2024. URL https://arxiv.org/abs/2412.09871

work page arXiv 2024
[19]

Nawrot, J

Piotr Nawrot, Jan Chorowski, Adrian Lancucki, and Edoardo Maria Ponti. Efficient transformers with dynamic token pooling. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages 6403–6417, Toronto, Canada, July 2023. Association...

work page doi:10.18653/v1/2023.acl-long.353 2023
[20]

MEGABYTE: Predicting million-byte sequences with multiscale transformers

LILI YU, Daniel Simig, Colin Flaherty, Armen Aghajanyan, Luke Zettlemoyer, and Mike Lewis. MEGABYTE: Predicting million-byte sequences with multiscale transformers. InThirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview. net/forum?id=JTmO2V9Xpz

work page 2023
[21]

The fineweb datasets: Decanting the web for the finest text data at scale.Advances in Neural Information Processing Systems, 37:30811–30849, 2024

Guilherme Penedo, Hynek Kydlíček, Anton Lozhkov, Margaret Mitchell, Colin A Raffel, Leandro Von Werra, Thomas Wolf, et al. The fineweb datasets: Decanting the web for the finest text data at scale.Advances in Neural Information Processing Systems, 37:30811–30849, 2024

work page 2024
[22]

Fineweb2: A sparkling update with 1000s of languages, December 2024

Guilherme Penedo, Hynek Kydlíček, Vinko Sabolčec, Bettina Messmer, Negar Foroutan, Martin Jaggi, Leandro von Werra, and Thomas Wolf. Fineweb2: A sparkling update with 1000s of languages, December 2024. URL https://huggingface.co/datasets/ HuggingFaceFW/fineweb-2. 11

work page 2024
[23]

XNLI: Evaluating Cross-lingual Sentence Representations

Alexis Conneau, Guillaume Lample, Ruty Rinott, Adina Williams, Samuel R Bowman, Holger Schwenk, and Veselin Stoyanov. Xnli: Evaluating cross-lingual sentence representations.arXiv preprint arXiv:1809.05053, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[24]

Alabi, Yanke Mao, Haonan Gao, and Annie En-Shiun Lee

David Ifeoluwa Adelani, Hannah Liu, Xiaoyu Shen, Nikita Vassilyev, Jesujoba O Alabi, Yanke Mao, Haonan Gao, and Annie En-Shiun Lee. Sib-200: A simple, inclusive, and big evaluation dataset for topic classification in 200+ languages and dialects.arXiv preprint arXiv:2309.07445, 2023

work page arXiv 2023
[25]

Multilingualsentiment: A multilingual sentiment classification dataset, 2024

clapAI. Multilingualsentiment: A multilingual sentiment classification dataset, 2024. URL https://huggingface.co/datasets/clapAI/MultiLingualSentiment

work page 2024
[26]

In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Xiaoman Pan, Boliang Zhang, Jonathan May, Joel Nothman, Kevin Knight, and Heng Ji. Cross- lingual name tagging and linking for 282 languages. InProceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages 1946–1958, Vancouver, Canada, July 2017. Association for Computational Linguistics. doi: 10.1...

work page doi:10.18653/v1/p17-1178 1946
[27]

Association for Computational Linguistics

Marcos Zampieri, Preslav Nakov, Nikola Ljubešić, Jörg Tiedemann, Shervin Malmasi, and Ahmed Ali, editors.Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2018), Santa Fe, New Mexico, USA, August 2018. Association for Computational Linguistics. URL https://aclanthology.org/W18-3900/

work page 2018
[28]

Evaluating unsupervised text classification: zero-shot and similarity-based approaches

Tim Schopf, Daniel Braun, and Florian Matthes. Evaluating unsupervised text classification: zero-shot and similarity-based approaches. InProceedings of the 2022 6th International Conference on Natural Language Processing and Information Retrieval, pages 6–15, 2022

work page 2022
[29]

Wlv at semeval-2018 task 3: Dissecting tweets in search of irony

Omid Rohanian, Shiva Taslimipoor, Richard Evans, and Ruslan Mitkov. Wlv at semeval-2018 task 3: Dissecting tweets in search of irony. In Proceedings of The 12th International Workshop on Semantic Evaluation, pages 553–559, 2018

work page 2018
[30]

No Language Left Behind: Scaling Human-Centered Machine Translation

Marta R Costa-Jussà, James Cross, Onur Çelebi, Maha Elbayad, Kenneth Heafield, Kevin Hef- fernan, Elahe Kalbassi, Janice Lam, Daniel Licht, Jean Maillard, et al. No language left behind: Scaling human-centered machine translation.arXiv preprint arXiv:2207.04672, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[31]

Adam: A Method for Stochastic Optimization

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[32]

Generating Sequences With Recurrent Neural Networks

Alex Graves. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013
[33]

Scaling laws with vocabulary: Larger models deserve larger vocabularies, 2024

Chaofan Tao, Qian Liu, Longxu Dou, Niklas Muennighoff, Zhongwei Wan, Ping Luo, Min Lin, and Ngai Wong. Scaling laws with vocabulary: Larger models deserve larger vocabularies, 2024. URL https://arxiv.org/abs/2407.13623

work page arXiv 2024
[34]

In: Zong, C., Xia, F., Li, W., Navigli, R

Tomasz Limisiewicz, Terra Blevins, Hila Gonen, Orevaoghene Ahia, and Luke Zettlemoyer. MYTE: Morphology-driven byte encoding for better and fairer multilingual language modeling. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages ...

work page doi:10.18653/v1/ 2024
[35]

Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Xin Zhao, and Ji-Rong Wen

Jonas Lotz, Elizabeth Salesky, Phillip Rust, and Desmond Elliott. Text rendering strategies for pixel language models. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages 10155–10172, Singapore, December 2023. Association for Computational Linguistics. doi: 10...

work page doi:10.18653/v1/2023 2023
[36]

Lotz, Emanuele Bugliarello, Elizabeth Salesky, Miryam de Lhoneux, and Desmond Elliott

Phillip Rust, Jonas F. Lotz, Emanuele Bugliarello, Elizabeth Salesky, Miryam de Lhoneux, and Desmond Elliott. Language modelling with pixels. InThe Eleventh International Confer- ence on Learning Representations, 2023. URL https://openreview.net/forum?id= FkSp8VW8RjH. 12

work page 2023
[37]

Multilingual pixel representations for translation and effective cross-lingual transfer

Elizabeth Salesky, Neha Verma, Philipp Koehn, and Matt Post. Multilingual pixel representations for translation and effective cross-lingual transfer. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 13845–13861, Singapore, December 2023. Association for Com...

work page doi:10.18653/v1/2023.emnlp-main.854 2023
[38]

and Garrette, Dan and Turc, Iulia and Wieting, John

Jonathan H. Clark, Dan Garrette, Iulia Turc, and John Wieting. Canine: Pre-training an efficient tokenization-free encoder for language representation. Transactions of the Associ- ation for Computational Linguistics , 10:73–91, 2022. doi: 10.1162/tacl_a_00448. URL https://aclanthology.org/2022.tacl-1.5/

work page doi:10.1162/tacl_a_00448 2022
[39]

MANTa: Efficient gradient-based tokenization for end-to-end robust language modeling

Nathan Godey, Roman Castagné, Éric de la Clergerie, and Benoît Sagot. MANTa: Efficient gradient-based tokenization for end-to-end robust language modeling. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, editors, Findings of the Association for Computational Linguistics: EMNLP 2022, pages 2859–2870, Abu Dhabi, United Arab Emirates, December 2022. Asso...

work page doi:10.18653/v1/2022.findings-emnlp.207 2022
[40]

Tran, Sebastian Ruder, Jai Gupta, Hyung Won Chung, Dara Bahri, Zhen Qin, Simon Baumgartner, Cong Yu, and Donald Metzler

Yi Tay, Vinh Q. Tran, Sebastian Ruder, Jai Gupta, Hyung Won Chung, Dara Bahri, Zhen Qin, Simon Baumgartner, Cong Yu, and Donald Metzler. Charformer: Fast character transformers via gradient-based subword tokenization. InInternational Conference on Learning Representations,

work page
[41]

URL https://openreview.net/forum?id=JtBRnrlOEFN

work page
[42]

A vocabulary-free multilingual neural tokenizer for end-to-end task learning

Md Mofijul Islam, Gustavo Aguilar, Pragaash Ponnusamy, Clint Solomon Mathialagan, Chengyuan Ma, and Chenlei Guo. A vocabulary-free multilingual neural tokenizer for end-to-end task learning. In Spandana Gella, He He, Bodhisattwa Prasad Majumder, Burcu Can, Eleonora Giunchiglia, Samuel Cahyawijaya, Sewon Min, Maximilian Mozes, Xiang Lorraine Li, Isabelle A...

work page doi:10.18653/v1/2022.repl4nlp-1.10 2022
[43]

Tokenization counts: the impact of tokenization on arithmetic in frontier LLMs.arXiv preprint arXiv:2402.14903,

Aaditya K. Singh and DJ Strouse. Tokenization counts: the impact of tokenization on arithmetic in frontier llms, 2024. URLhttps://arxiv.org/abs/2402.14903

work page arXiv 2024
[44]

Improving consistency in LLM inference using probabilistic tokenization

Ashutosh Sathe, Divyanshu Aggarwal, and Sunayana Sitaram. Improving consistency in LLM inference using probabilistic tokenization. In Luis Chiruzzo, Alan Ritter, and Lu Wang, editors, Findings of the Association for Computational Linguistics: NAACL 2025 , pages 4766–4778, Albuquerque, New Mexico, April 2025. Association for Computational Linguistics. ISBN...

work page 2025
[45]

Should we find another model?: Improving neural machine translation performance with ONE-piece tokenization method without model modification

Chanjun Park, Sugyeong Eo, Hyeonseok Moon, and Heuiseok Lim. Should we find another model?: Improving neural machine translation performance with ONE-piece tokenization method without model modification. In Young-bum Kim, Yunyao Li, and Owen Rambow, editors, Proceedings of the 2021 Conference of the North American Chapter of the Association for Computatio...

work page doi:10.18653/v1/2021.naacl-industry.13 2021
[46]

Alabi, David Ifeoluwa Adelani, Marius Mosbach, and Dietrich Klakow

Jesujoba O. Alabi, David Ifeoluwa Adelani, Marius Mosbach, and Dietrich Klakow. Adapting pre- trained language models to African languages via multilingual adaptive fine-tuning. In Nicoletta Calzolari, Chu-Ren Huang, Hansaem Kim, James Pustejovsky, Leo Wanner, Key-Sun Choi, Pum-Mo Ryu, Hsin-Hsi Chen, Lucia Donatelli, Heng Ji, Sadao Kurohashi, Patrizia Pag...

work page 2022
[47]

WECHSEL : Effective initialization of subword embeddings for cross-lingual transfer of monolingual language models

Benjamin Minixhofer, Fabian Paischer, and Navid Rekabsaz. WECHSEL: Effective initialization of subword embeddings for cross-lingual transfer of monolingual language models. In Marine 13 Carpuat, Marie-Catherine de Marneffe, and Ivan Vladimir Meza Ruiz, editors, Proceedings of the 2022 Conference of the North American Chapter of the Association for Computa...

work page doi:10.18653/v1/2022.naacl-main.293 2022
[48]

Efficient Domain Adaptation of Language Models via Adaptive Tokenization

Vin Sachidananda, Jason Kessler, and Yi-An Lai. Efficient domain adaptation of language models via adaptive tokenization. In Nafise Sadat Moosavi, Iryna Gurevych, Angela Fan, Thomas Wolf, Yufang Hou, Ana Marasović, and Sujith Ravi, editors, Proceedings of the Second Workshop on Simple and Efficient Natural Language Processing, pages 155–165, Virtual, Nove...

work page doi:10.18653/v1/2021.sustainlp-1.16 2021
[49]

Task- adaptive tokenization: Enhancing long-form text generation efficacy in mental health and beyond

Siyang Liu, Naihao Deng, Sahand Sabour, Yilin Jia, Minlie Huang, and Rada Mihalcea. Task- adaptive tokenization: Enhancing long-form text generation efficacy in mental health and beyond. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 15264–15281, Singapore...

work page 2023
[50]

Task-Adaptive Tokenization: Enhancing Long-Form Text Generation Efficacy in Mental Health and Beyond

Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.944. URL https://aclanthology.org/2023.emnlp-main.944/. 14 Appendix A Limitations Our limited computational budget prevents us from training larger models with more language on larger datasets. We anticipate the results will improve with scaling potentially providing even higher c...

work page doi:10.18653/v1/2023.emnlp-main.944 2023

[1] [1]

Tokenizer choice for llm training: Negligible or crucial? In Findings of the Association for Computational Linguistics: NAACL 2024, pages 3907–3924, 2024

Mehdi Ali, Michael Fromm, Klaudia Thellmann, Richard Rutmann, Max Lübbering, Johannes Leveling, Katrin Klug, Jan Ebert, Niclas Doll, Jasper Buschhoff, et al. Tokenizer choice for llm training: Negligible or crucial? In Findings of the Association for Computational Linguistics: NAACL 2024, pages 3907–3924, 2024

work page 2024

[2] [2]

Coercing llms to do and reveal (almost) anything

Jonas Geiping, Alex Stein, Manli Shu, Khalid Saifullah, Yuxin Wen, and Tom Goldstein. Coercing llms to do and reveal (almost) anything.arXiv preprint arXiv:2402.14020, 2024

work page arXiv 2024

[3] [3]

arXiv preprint arXiv:2405.05417 , year=

Sander Land and Max Bartolo. Fishing for magikarp: Automatically detecting under-trained tokens in large language models.arXiv preprint arXiv:2405.05417, 2024

work page arXiv 2024

[4] [4]

Neural machine translation of rare words with subword units

Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword units. In Katrin Erk and Noah A. Smith, editors, Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1715–1725, Berlin, Germany, August 2016. Association for Computational Linguistics. d...

work page doi:10.18653/v1/p16-1162 2016

[5] [5]

Bert: Pre-training of deep bidirectional transformers for language understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. InProceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers) , pages 4171–4186, 2019

work page 2019

[6] [6]

Jacovi, A., Caciularu, A., Goldman, O., and Goldberg, Y

Orevaoghene Ahia, Sachin Kumar, Hila Gonen, Jungo Kasai, David Mortensen, Noah Smith, and Yulia Tsvetkov. Do all languages cost the same? tokenization in the era of commercial language models. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 9904–9923, Singa...

work page doi:10.18653/v1/2023.emnlp-main 2023

[7] [7]

URL https://aclanthology.org/2023.emnlp-main.614/

work page 2023

[8] [8]

Language model tokenizers introduce unfairness between languages.Advances in neural information processing systems, 36: 36963–36990, 2023

Aleksandar Petrov, Emanuele La Malfa, Philip Torr, and Adel Bibi. Language model tokenizers introduce unfairness between languages.Advances in neural information processing systems, 36: 36963–36990, 2023

work page 2023

[9] [9]

Getting the most out of your tokenizer for pre-training and domain adaptation

Gautier Dagan, Gabriel Synnaeve, and Baptiste Rozière. Getting the most out of your tokenizer for pre-training and domain adaptation. InProceedings of the 41st International Conference on Machine Learning, ICML’24. JMLR.org, 2024. 10

work page 2024

[10] [10]

Zero-shot tokenizer transfer.arXiv preprint arXiv:2405.07883, 2024

Benjamin Minixhofer, Edoardo Maria Ponti, and Ivan Vulić. Zero-shot tokenizer transfer.arXiv preprint arXiv:2405.07883, 2024

work page arXiv 2024

[11] [11]

Li, J.; Huang, S.; Ching, A.; Dai, X.; and Chen, J

Haonan Li, Fajri Koto, Minghao Wu, Alham Fikri Aji, and Timothy Baldwin. Bactrian-x: Multilingual replicable instruction-following models with low-rank adaptation.arXiv preprint arXiv:2305.15011, 2023

work page arXiv 2023

[12] [12]

ByT5: Towards a Token-Free Future with Pre-trained Byte-to-Byte Models

Linting Xue, Aditya Barua, Noah Constant, Rami Al-Rfou, Sharan Narang, Mihir Kale, Adam Roberts, and Colin Raffel. ByT5: Towards a token-free future with pre-trained byte-to-byte models. Transactions of the Association for Computational Linguistics, 10:291–306, 2022. doi: 10.1162/tacl_a_00461. URL https://aclanthology.org/2022.tacl-1.17/

work page doi:10.1162/tacl_a_00461 2022

[13] [13]

Character-level language modeling with deeper self-attention

Rami Al-Rfou, Dokook Choe, Noah Constant, Mandy Guo, and Llion Jones. Character-level language modeling with deeper self-attention. InAAAI Conference on Artificial Intelligence, 2018. URL https://api.semanticscholar.org/CorpusID:52004855

work page 2018

[14] [14]

Mambabyte: Token-free selective state space model

Junxiong Wang, Tushaar Gangavarapu, Jing Nathan Yan, and Alexander M Rush. Mambabyte: Token-free selective state space model. InFirst Conference on Language Modeling, 2024. URL https://openreview.net/forum?id=X1xNsuKssb

work page 2024

[15] [15]

Charformer: Fast character transformers via gradient-based subword tokenization.arXiv preprint arXiv:2106.12672, 2021

Yi Tay, Vinh Q Tran, Sebastian Ruder, Jai Gupta, Hyung Won Chung, Dara Bahri, Zhen Qin, Simon Baumgartner, Cong Yu, and Donald Metzler. Charformer: Fast character transformers via gradient-based subword tokenization.arXiv preprint arXiv:2106.12672, 2021

work page arXiv 2021

[16] [16]

37 Wenyue Hua, Lizhou Fan, Lingyao Li, Kai Mei, Jianchao Ji, Yingqiang Ge, Libby Hemphill, and Yongfeng Zhang

Piotr Nawrot, Szymon Tworkowski, Michał Tyrolski, Lukasz Kaiser, Yuhuai Wu, Christian Szegedy, and Henryk Michalewski. Hierarchical transformers are more efficient language models. In Marine Carpuat, Marie-Catherine de Marneffe, and Ivan Vladimir Meza Ruiz, editors,Find- ings of the Association for Computational Linguistics: NAACL 2022 , pages 1559–1571, ...

work page doi:10.18653/v1/2022 2022

[17] [17]

Magnet: Improving the multilingual fairness of language models with adaptive gradient-based tokenization.Advances in Neural Information Processing Systems, 37: 47790–47814, 2024

Orevaoghene Ahia, Sachin Kumar, Hila Gonen, Valentin Hofmann, Tomasz Limisiewicz, Yulia Tsvetkov, and Noah A Smith. Magnet: Improving the multilingual fairness of language models with adaptive gradient-based tokenization.Advances in Neural Information Processing Systems, 37: 47790–47814, 2024

work page 2024

[18] [18]

arXiv preprint arXiv:2412.09871 , year=

Artidoro Pagnoni, Ram Pasunuru, Pedro Rodriguez, John Nguyen, Benjamin Muller, Margaret Li, Chunting Zhou, Lili Yu, Jason Weston, Luke Zettlemoyer, Gargi Ghosh, Mike Lewis, Ari Holtzman, and Srinivasan Iyer. Byte latent transformer: Patches scale better than tokens, 2024. URL https://arxiv.org/abs/2412.09871

work page arXiv 2024

[19] [19]

Nawrot, J

Piotr Nawrot, Jan Chorowski, Adrian Lancucki, and Edoardo Maria Ponti. Efficient transformers with dynamic token pooling. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages 6403–6417, Toronto, Canada, July 2023. Association...

work page doi:10.18653/v1/2023.acl-long.353 2023

[20] [20]

MEGABYTE: Predicting million-byte sequences with multiscale transformers

LILI YU, Daniel Simig, Colin Flaherty, Armen Aghajanyan, Luke Zettlemoyer, and Mike Lewis. MEGABYTE: Predicting million-byte sequences with multiscale transformers. InThirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview. net/forum?id=JTmO2V9Xpz

work page 2023

[21] [21]

The fineweb datasets: Decanting the web for the finest text data at scale.Advances in Neural Information Processing Systems, 37:30811–30849, 2024

Guilherme Penedo, Hynek Kydlíček, Anton Lozhkov, Margaret Mitchell, Colin A Raffel, Leandro Von Werra, Thomas Wolf, et al. The fineweb datasets: Decanting the web for the finest text data at scale.Advances in Neural Information Processing Systems, 37:30811–30849, 2024

work page 2024

[22] [22]

Fineweb2: A sparkling update with 1000s of languages, December 2024

Guilherme Penedo, Hynek Kydlíček, Vinko Sabolčec, Bettina Messmer, Negar Foroutan, Martin Jaggi, Leandro von Werra, and Thomas Wolf. Fineweb2: A sparkling update with 1000s of languages, December 2024. URL https://huggingface.co/datasets/ HuggingFaceFW/fineweb-2. 11

work page 2024

[23] [23]

XNLI: Evaluating Cross-lingual Sentence Representations

Alexis Conneau, Guillaume Lample, Ruty Rinott, Adina Williams, Samuel R Bowman, Holger Schwenk, and Veselin Stoyanov. Xnli: Evaluating cross-lingual sentence representations.arXiv preprint arXiv:1809.05053, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[24] [24]

Alabi, Yanke Mao, Haonan Gao, and Annie En-Shiun Lee

David Ifeoluwa Adelani, Hannah Liu, Xiaoyu Shen, Nikita Vassilyev, Jesujoba O Alabi, Yanke Mao, Haonan Gao, and Annie En-Shiun Lee. Sib-200: A simple, inclusive, and big evaluation dataset for topic classification in 200+ languages and dialects.arXiv preprint arXiv:2309.07445, 2023

work page arXiv 2023

[25] [25]

Multilingualsentiment: A multilingual sentiment classification dataset, 2024

clapAI. Multilingualsentiment: A multilingual sentiment classification dataset, 2024. URL https://huggingface.co/datasets/clapAI/MultiLingualSentiment

work page 2024

[26] [26]

In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Xiaoman Pan, Boliang Zhang, Jonathan May, Joel Nothman, Kevin Knight, and Heng Ji. Cross- lingual name tagging and linking for 282 languages. InProceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages 1946–1958, Vancouver, Canada, July 2017. Association for Computational Linguistics. doi: 10.1...

work page doi:10.18653/v1/p17-1178 1946

[27] [27]

Association for Computational Linguistics

Marcos Zampieri, Preslav Nakov, Nikola Ljubešić, Jörg Tiedemann, Shervin Malmasi, and Ahmed Ali, editors.Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2018), Santa Fe, New Mexico, USA, August 2018. Association for Computational Linguistics. URL https://aclanthology.org/W18-3900/

work page 2018

[28] [28]

Evaluating unsupervised text classification: zero-shot and similarity-based approaches

Tim Schopf, Daniel Braun, and Florian Matthes. Evaluating unsupervised text classification: zero-shot and similarity-based approaches. InProceedings of the 2022 6th International Conference on Natural Language Processing and Information Retrieval, pages 6–15, 2022

work page 2022

[29] [29]

Wlv at semeval-2018 task 3: Dissecting tweets in search of irony

Omid Rohanian, Shiva Taslimipoor, Richard Evans, and Ruslan Mitkov. Wlv at semeval-2018 task 3: Dissecting tweets in search of irony. In Proceedings of The 12th International Workshop on Semantic Evaluation, pages 553–559, 2018

work page 2018

[30] [30]

No Language Left Behind: Scaling Human-Centered Machine Translation

Marta R Costa-Jussà, James Cross, Onur Çelebi, Maha Elbayad, Kenneth Heafield, Kevin Hef- fernan, Elahe Kalbassi, Janice Lam, Daniel Licht, Jean Maillard, et al. No language left behind: Scaling human-centered machine translation.arXiv preprint arXiv:2207.04672, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[31] [31]

Adam: A Method for Stochastic Optimization

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014

[32] [32]

Generating Sequences With Recurrent Neural Networks

Alex Graves. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013

[33] [33]

Scaling laws with vocabulary: Larger models deserve larger vocabularies, 2024

Chaofan Tao, Qian Liu, Longxu Dou, Niklas Muennighoff, Zhongwei Wan, Ping Luo, Min Lin, and Ngai Wong. Scaling laws with vocabulary: Larger models deserve larger vocabularies, 2024. URL https://arxiv.org/abs/2407.13623

work page arXiv 2024

[34] [34]

In: Zong, C., Xia, F., Li, W., Navigli, R

Tomasz Limisiewicz, Terra Blevins, Hila Gonen, Orevaoghene Ahia, and Luke Zettlemoyer. MYTE: Morphology-driven byte encoding for better and fairer multilingual language modeling. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages ...

work page doi:10.18653/v1/ 2024

[35] [35]

Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Xin Zhao, and Ji-Rong Wen

Jonas Lotz, Elizabeth Salesky, Phillip Rust, and Desmond Elliott. Text rendering strategies for pixel language models. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages 10155–10172, Singapore, December 2023. Association for Computational Linguistics. doi: 10...

work page doi:10.18653/v1/2023 2023

[36] [36]

Lotz, Emanuele Bugliarello, Elizabeth Salesky, Miryam de Lhoneux, and Desmond Elliott

Phillip Rust, Jonas F. Lotz, Emanuele Bugliarello, Elizabeth Salesky, Miryam de Lhoneux, and Desmond Elliott. Language modelling with pixels. InThe Eleventh International Confer- ence on Learning Representations, 2023. URL https://openreview.net/forum?id= FkSp8VW8RjH. 12

work page 2023

[37] [37]

Multilingual pixel representations for translation and effective cross-lingual transfer

Elizabeth Salesky, Neha Verma, Philipp Koehn, and Matt Post. Multilingual pixel representations for translation and effective cross-lingual transfer. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 13845–13861, Singapore, December 2023. Association for Com...

work page doi:10.18653/v1/2023.emnlp-main.854 2023

[38] [38]

and Garrette, Dan and Turc, Iulia and Wieting, John

Jonathan H. Clark, Dan Garrette, Iulia Turc, and John Wieting. Canine: Pre-training an efficient tokenization-free encoder for language representation. Transactions of the Associ- ation for Computational Linguistics , 10:73–91, 2022. doi: 10.1162/tacl_a_00448. URL https://aclanthology.org/2022.tacl-1.5/

work page doi:10.1162/tacl_a_00448 2022

[39] [39]

MANTa: Efficient gradient-based tokenization for end-to-end robust language modeling

Nathan Godey, Roman Castagné, Éric de la Clergerie, and Benoît Sagot. MANTa: Efficient gradient-based tokenization for end-to-end robust language modeling. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, editors, Findings of the Association for Computational Linguistics: EMNLP 2022, pages 2859–2870, Abu Dhabi, United Arab Emirates, December 2022. Asso...

work page doi:10.18653/v1/2022.findings-emnlp.207 2022

[40] [40]

Tran, Sebastian Ruder, Jai Gupta, Hyung Won Chung, Dara Bahri, Zhen Qin, Simon Baumgartner, Cong Yu, and Donald Metzler

Yi Tay, Vinh Q. Tran, Sebastian Ruder, Jai Gupta, Hyung Won Chung, Dara Bahri, Zhen Qin, Simon Baumgartner, Cong Yu, and Donald Metzler. Charformer: Fast character transformers via gradient-based subword tokenization. InInternational Conference on Learning Representations,

work page

[41] [41]

URL https://openreview.net/forum?id=JtBRnrlOEFN

work page

[42] [42]

A vocabulary-free multilingual neural tokenizer for end-to-end task learning

Md Mofijul Islam, Gustavo Aguilar, Pragaash Ponnusamy, Clint Solomon Mathialagan, Chengyuan Ma, and Chenlei Guo. A vocabulary-free multilingual neural tokenizer for end-to-end task learning. In Spandana Gella, He He, Bodhisattwa Prasad Majumder, Burcu Can, Eleonora Giunchiglia, Samuel Cahyawijaya, Sewon Min, Maximilian Mozes, Xiang Lorraine Li, Isabelle A...

work page doi:10.18653/v1/2022.repl4nlp-1.10 2022

[43] [43]

Tokenization counts: the impact of tokenization on arithmetic in frontier LLMs.arXiv preprint arXiv:2402.14903,

Aaditya K. Singh and DJ Strouse. Tokenization counts: the impact of tokenization on arithmetic in frontier llms, 2024. URLhttps://arxiv.org/abs/2402.14903

work page arXiv 2024

[44] [44]

Improving consistency in LLM inference using probabilistic tokenization

Ashutosh Sathe, Divyanshu Aggarwal, and Sunayana Sitaram. Improving consistency in LLM inference using probabilistic tokenization. In Luis Chiruzzo, Alan Ritter, and Lu Wang, editors, Findings of the Association for Computational Linguistics: NAACL 2025 , pages 4766–4778, Albuquerque, New Mexico, April 2025. Association for Computational Linguistics. ISBN...

work page 2025

[45] [45]

Should we find another model?: Improving neural machine translation performance with ONE-piece tokenization method without model modification

Chanjun Park, Sugyeong Eo, Hyeonseok Moon, and Heuiseok Lim. Should we find another model?: Improving neural machine translation performance with ONE-piece tokenization method without model modification. In Young-bum Kim, Yunyao Li, and Owen Rambow, editors, Proceedings of the 2021 Conference of the North American Chapter of the Association for Computatio...

work page doi:10.18653/v1/2021.naacl-industry.13 2021

[46] [46]

Alabi, David Ifeoluwa Adelani, Marius Mosbach, and Dietrich Klakow

Jesujoba O. Alabi, David Ifeoluwa Adelani, Marius Mosbach, and Dietrich Klakow. Adapting pre- trained language models to African languages via multilingual adaptive fine-tuning. In Nicoletta Calzolari, Chu-Ren Huang, Hansaem Kim, James Pustejovsky, Leo Wanner, Key-Sun Choi, Pum-Mo Ryu, Hsin-Hsi Chen, Lucia Donatelli, Heng Ji, Sadao Kurohashi, Patrizia Pag...

work page 2022

[47] [47]

WECHSEL : Effective initialization of subword embeddings for cross-lingual transfer of monolingual language models

Benjamin Minixhofer, Fabian Paischer, and Navid Rekabsaz. WECHSEL: Effective initialization of subword embeddings for cross-lingual transfer of monolingual language models. In Marine 13 Carpuat, Marie-Catherine de Marneffe, and Ivan Vladimir Meza Ruiz, editors, Proceedings of the 2022 Conference of the North American Chapter of the Association for Computa...

work page doi:10.18653/v1/2022.naacl-main.293 2022

[48] [48]

Efficient Domain Adaptation of Language Models via Adaptive Tokenization

Vin Sachidananda, Jason Kessler, and Yi-An Lai. Efficient domain adaptation of language models via adaptive tokenization. In Nafise Sadat Moosavi, Iryna Gurevych, Angela Fan, Thomas Wolf, Yufang Hou, Ana Marasović, and Sujith Ravi, editors, Proceedings of the Second Workshop on Simple and Efficient Natural Language Processing, pages 155–165, Virtual, Nove...

work page doi:10.18653/v1/2021.sustainlp-1.16 2021

[49] [49]

Task- adaptive tokenization: Enhancing long-form text generation efficacy in mental health and beyond

Siyang Liu, Naihao Deng, Sahand Sabour, Yilin Jia, Minlie Huang, and Rada Mihalcea. Task- adaptive tokenization: Enhancing long-form text generation efficacy in mental health and beyond. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 15264–15281, Singapore...

work page 2023

[50] [50]

Task-Adaptive Tokenization: Enhancing Long-Form Text Generation Efficacy in Mental Health and Beyond

Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.944. URL https://aclanthology.org/2023.emnlp-main.944/. 14 Appendix A Limitations Our limited computational budget prevents us from training larger models with more language on larger datasets. We anticipate the results will improve with scaling potentially providing even higher c...

work page doi:10.18653/v1/2023.emnlp-main.944 2023