pith. sign in

arxiv: 2507.12720 · v4 · pith:57GQC52Nnew · submitted 2025-07-17 · 💻 cs.CL

FLEXITOKENS: Flexible Tokenization for Evolving Language Models

Pith reviewed 2026-05-19 05:07 UTC · model grok-4.3

classification 💻 cs.CL
keywords flexible tokenizationadaptive tokenizersbyte-level language modelsboundary predictionlanguage model adaptationtokenizer-free methodsmultilingual benchmarksover-fragmentation
0
0 comments X

The pith

A simplified training objective lets byte-level language models learn flexible token boundaries that adapt to new domains and languages without fixed compression constraints.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Language models face adaptation challenges because fixed subword tokenizers cause over-fragmentation when encountering new distributions, languages, or scripts. The work builds byte-level models that include a submodule to predict variable-length token boundaries directly from byte sequences. Prior tokenizer-free methods add an auxiliary loss to enforce a constant compression rate across the corpus, which creates its own rigidity. FLEXITOKENS replaces this with a simpler objective that removes the fixed-rate constraint, allowing the boundary predictor to learn more suitable segmentations for the current data. Across multilingual benchmarks, morphologically rich tasks, and domain shifts, the approach reduces over-fragmentation and delivers up to 10 percentage point gains on both classification and generation compared with BPE and other gradient-based tokenizers, with gains holding across model scales.

Core claim

By training the boundary predictor with a simplified objective that omits the auxiliary fixed-compression loss, byte-level language models acquire tokenizations that adapt to the input distribution rather than remaining locked to a preset compression rate, yielding lower over-fragmentation and higher downstream accuracy on out-of-distribution data.

What carries the argument

The FLEXITOKENS simplified training objective for the byte-sequence boundary predictor, which learns to insert variable-length token boundaries without an auxiliary term that enforces fixed corpus-wide compression.

If this is right

  • Token over-fragmentation decreases on out-of-distribution domains and morphologically diverse languages.
  • Downstream token classification and generative tasks improve by as much as 10 percentage points relative to BPE and other gradient-based tokenizer baselines.
  • Performance gains remain consistent when the same method is applied to models of different sizes.
  • Adaptation to new data distributions becomes possible through finetuning alone without manual tokenizer replacement.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Models could continue learning new segmentation strategies as they encounter fresh data streams over time, supporting longer-term evolution without periodic tokenizer retraining.
  • The removal of the fixed-compression auxiliary term may simplify extension to low-resource languages or mixed-script text that current fixed tokenizers handle poorly.
  • Similar simplification of auxiliary objectives could be tested in other adaptive components such as dynamic vocabulary growth or on-the-fly embedding updates.

Load-bearing premise

The boundary predictor trained with the simplified objective will still produce segmentations useful for downstream performance rather than collapsing to trivial or degenerate boundaries.

What would settle it

A controlled experiment on new domains or unseen scripts in which the FLEXITOKENS model shows no reduction in over-fragmentation and no accuracy gains over BPE or fixed-compression baselines would falsify the central claim.

Figures

Figures reproduced from arXiv: 2507.12720 by Abraham Toluwase Owodunni, Orevaoghene Ahia, Sachin Kumar.

Figure 1
Figure 1. Figure 1: We present an example of tokenized medical text, where [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: FineWeb Test BPB (↓), Compression rate (↑) and Compression variance (↑) of FlexıTokens compared to the bınomıal variant with αA = 0.3 and λ = 3. Higher compression rates result in fewer tokens, which in turn leads to a more efficient model. Overall, FlexıTokens 1B model achieves the best score across all metrics 7We use a shorter sequence length during pretraining due to computational constraints. 8Note th… view at source ↗
Figure 3
Figure 3. Figure 3: Average number of tokens per sample obtained in the FLORES dataset with different tokenization algorithms. FlexıTokens consistently produces the least number of tokens while maintaining balance across languages, even for the un￾seen language Urdu. BPE over-fragments seen (Hindi, Telugu) as well as unseen languages (Urdu). We also observe a higher variance in compression rates of FlexıTokens implying higher… view at source ↗
Figure 4
Figure 4. Figure 4: Compression rate changes with FlexıTokens across multiple tasks. Initial is the base compression rate before pretraining. Compression rate for bınomıal remains relatively low while we also see a spike for task like XNLI en es ru uk hi te Language 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 Bits per Bytes 1.593 1.443 0.885 0.894 0.694 0.662 1.592 1.443 0.878 0.885 0.701 0.655 1.585 1.435 0.885 0.894 0.690 … view at source ↗
Figure 5
Figure 5. Figure 5: FineWeb Test results for ablating the number layers in [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Number of training documents sampled by language [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗
read the original abstract

Adapting language models to new data distributions by simple finetuning is challenging. This is due to the rigidity of their subword tokenizers, which typically remain unchanged during adaptation. This inflexibility often leads to inefficient tokenization, causing overfragmentation of text in out-of-distribution domains, unseen languages, or scripts. In this work, we develop byte-level LMs with learnable tokenizers to make tokenization adaptive. Our models include a submodule that learns to predict boundaries given the input byte sequence, encoding it into variable-length segments. Most tokenizer-free methods train this boundary predictor using an auxiliary loss that enforces a fixed compression rate across the training corpus, introducing a new kind of rigidity. We propose FLEXITOKENS, a simplified training objective that enables significantly greater flexibility during adaptation. Evaluating across multiple multilingual benchmarks, morphologically diverse tasks, and domains, we demonstrate that FLEXITOKENS consistently reduces token over-fragmentation and achieves up to 10% point improvements on token classification and generative tasks compared to BPE and other gradient-based tokenizer baselines. We validate our findings using models of varying sizes, and our method demonstrates consistent improvements across scales. Code and data for our experiments will be released at https://github.com/skai-research/flexitokens

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces FLEXITOKENS, a byte-level language model with a learnable boundary predictor for variable-length tokenization. It proposes a simplified training objective that removes the auxiliary fixed-compression loss common in prior tokenizer-free approaches, claiming this yields greater flexibility during adaptation to new distributions, languages, or domains. The authors report that the method reduces token over-fragmentation and delivers up to 10 percentage point gains on token classification and generative tasks relative to BPE and gradient-based tokenizer baselines, with consistent improvements across model scales and multilingual/morphologically diverse benchmarks. Code and data release is promised.

Significance. If the central empirical claims are substantiated, the work could meaningfully advance adaptive tokenization for evolving language models by addressing subword rigidity in OOD settings. The simplification of the objective to remove an auxiliary constraint is a clear conceptual contribution. Credit is due for the cross-scale validation and the commitment to public code and data release, which directly supports reproducibility in this area.

major comments (2)
  1. [Method] Method section (description of the simplified objective): the boundary predictor is trained solely on the main task loss without the auxiliary compression term. No analysis or ablation is provided to demonstrate that this does not lead to degenerate fixed policies (always-emit or never-emit boundaries), which would directly undermine the claimed reduction in over-fragmentation and downstream gains. This is load-bearing for the headline performance claims.
  2. [Experiments] Experiments section: the abstract states consistent improvements and up to 10-point gains, yet the reported results lack details on statistical significance testing, precise baseline re-implementations, data splits, or explicit ablations isolating the effect of removing the auxiliary loss. These omissions prevent full verification of the cross-scale and cross-task claims.
minor comments (1)
  1. [Abstract] The abstract would be clearer if it briefly noted the range of model sizes used for the scale-consistency validation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback and for highlighting areas where additional evidence would strengthen the manuscript. We address each major comment below and commit to revisions that directly respond to the concerns raised.

read point-by-point responses
  1. Referee: [Method] Method section (description of the simplified objective): the boundary predictor is trained solely on the main task loss without the auxiliary compression term. No analysis or ablation is provided to demonstrate that this does not lead to degenerate fixed policies (always-emit or never-emit boundaries), which would directly undermine the claimed reduction in over-fragmentation and downstream gains. This is load-bearing for the headline performance claims.

    Authors: We agree that an explicit demonstration that the simplified objective avoids degenerate boundary policies is important for substantiating the core claims. In the current manuscript we relied on the fact that the boundary predictor is optimized jointly with the language modeling loss, which should penalize policies that produce uninformative segmentations (as these would increase perplexity). However, this argument is indirect. To address the referee's point directly, we will add an ablation in the revised manuscript that compares the learned boundary distributions against forced always-emit and never-emit baselines, along with the resulting tokenization statistics and downstream performance. This analysis will be placed in the method or experiments section. revision: yes

  2. Referee: [Experiments] Experiments section: the abstract states consistent improvements and up to 10-point gains, yet the reported results lack details on statistical significance testing, precise baseline re-implementations, data splits, or explicit ablations isolating the effect of removing the auxiliary loss. These omissions prevent full verification of the cross-scale and cross-task claims.

    Authors: We acknowledge that the current experimental reporting is insufficient for full verification. In the revision we will expand the experiments section to include: (i) statistical significance testing (multiple random seeds with reported means, standard deviations, and p-values where appropriate), (ii) precise descriptions of baseline re-implementations including all hyperparameters and training details, (iii) explicit documentation of data splits and preprocessing, and (iv) a dedicated ablation that isolates the contribution of removing the auxiliary compression loss. These additions will allow readers to reproduce and verify the cross-scale and cross-task results. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical proposal with independent experimental validation

full rationale

The paper introduces FLEXITOKENS as a new simplified training objective for a byte-level boundary predictor, removing the auxiliary fixed-compression loss used in prior tokenizer-free methods. All performance claims (reduced over-fragmentation, up to 10-point gains on classification and generation tasks) are supported by direct empirical evaluation on multilingual benchmarks, morphologically diverse tasks, and domain shifts across model scales. No equations, derivations, or fitted parameters are presented as 'predictions' that reduce to the inputs by construction. No self-citations, uniqueness theorems, or ansatzes from prior author work are invoked as load-bearing justification. The central result is therefore an empirical training change whose validity rests on external benchmark outcomes rather than any self-referential reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard assumptions of gradient-based optimization for sequence models and the premise that byte sequences contain sufficient signal for useful boundary prediction; no new free parameters or invented entities are introduced beyond the boundary predictor submodule itself.

axioms (1)
  • domain assumption Byte sequences provide a complete and sufficient representation for learning token boundaries in any script or language
    Implicit foundation for all byte-level modeling in the work

pith-pipeline@v0.9.0 · 5761 in / 1343 out tokens · 51639 ms · 2026-05-19T05:07:26.857776+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Scratchpad Patching: Decoupling Compute from Patch Size in Byte-Level Language Models

    cs.CL 2026-05 conditional novelty 7.0

    Scratchpad Patching decouples compute from patch size in byte-level language models by inserting entropy-triggered scratchpads to update patch context dynamically.

  2. Compute Optimal Tokenization

    cs.CL 2026-05 unverdicted novelty 6.0

    Compute-optimal language models require parameter count to scale with data bytes rather than tokens, with optimal token compression rate decreasing as compute budget grows.

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · cited by 2 Pith papers · 4 internal anchors

  1. [1]

    Tokenizer choice for llm training: Negligible or crucial? In Findings of the Association for Computational Linguistics: NAACL 2024, pages 3907–3924, 2024

    Mehdi Ali, Michael Fromm, Klaudia Thellmann, Richard Rutmann, Max Lübbering, Johannes Leveling, Katrin Klug, Jan Ebert, Niclas Doll, Jasper Buschhoff, et al. Tokenizer choice for llm training: Negligible or crucial? In Findings of the Association for Computational Linguistics: NAACL 2024, pages 3907–3924, 2024

  2. [2]

    Coercing llms to do and reveal (almost) anything

    Jonas Geiping, Alex Stein, Manli Shu, Khalid Saifullah, Yuxin Wen, and Tom Goldstein. Coercing llms to do and reveal (almost) anything.arXiv preprint arXiv:2402.14020, 2024

  3. [3]

    arXiv preprint arXiv:2405.05417 , year=

    Sander Land and Max Bartolo. Fishing for magikarp: Automatically detecting under-trained tokens in large language models.arXiv preprint arXiv:2405.05417, 2024

  4. [4]

    Neural machine translation of rare words with subword units

    Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword units. In Katrin Erk and Noah A. Smith, editors, Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1715–1725, Berlin, Germany, August 2016. Association for Computational Linguistics. d...

  5. [5]

    Bert: Pre-training of deep bidirectional transformers for language understanding

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. InProceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers) , pages 4171–4186, 2019

  6. [6]

    Jacovi, A., Caciularu, A., Goldman, O., and Goldberg, Y

    Orevaoghene Ahia, Sachin Kumar, Hila Gonen, Jungo Kasai, David Mortensen, Noah Smith, and Yulia Tsvetkov. Do all languages cost the same? tokenization in the era of commercial language models. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 9904–9923, Singa...

  7. [7]

    URL https://aclanthology.org/2023.emnlp-main.614/

  8. [8]

    Language model tokenizers introduce unfairness between languages.Advances in neural information processing systems, 36: 36963–36990, 2023

    Aleksandar Petrov, Emanuele La Malfa, Philip Torr, and Adel Bibi. Language model tokenizers introduce unfairness between languages.Advances in neural information processing systems, 36: 36963–36990, 2023

  9. [9]

    Getting the most out of your tokenizer for pre-training and domain adaptation

    Gautier Dagan, Gabriel Synnaeve, and Baptiste Rozière. Getting the most out of your tokenizer for pre-training and domain adaptation. InProceedings of the 41st International Conference on Machine Learning, ICML’24. JMLR.org, 2024. 10

  10. [10]

    Zero-shot tokenizer transfer.arXiv preprint arXiv:2405.07883, 2024

    Benjamin Minixhofer, Edoardo Maria Ponti, and Ivan Vulić. Zero-shot tokenizer transfer.arXiv preprint arXiv:2405.07883, 2024

  11. [11]

    Li, J.; Huang, S.; Ching, A.; Dai, X.; and Chen, J

    Haonan Li, Fajri Koto, Minghao Wu, Alham Fikri Aji, and Timothy Baldwin. Bactrian-x: Multilingual replicable instruction-following models with low-rank adaptation.arXiv preprint arXiv:2305.15011, 2023

  12. [12]

    ByT5: Towards a Token-Free Future with Pre-trained Byte-to-Byte Models

    Linting Xue, Aditya Barua, Noah Constant, Rami Al-Rfou, Sharan Narang, Mihir Kale, Adam Roberts, and Colin Raffel. ByT5: Towards a token-free future with pre-trained byte-to-byte models. Transactions of the Association for Computational Linguistics, 10:291–306, 2022. doi: 10.1162/tacl_a_00461. URL https://aclanthology.org/2022.tacl-1.17/

  13. [13]

    Character-level language modeling with deeper self-attention

    Rami Al-Rfou, Dokook Choe, Noah Constant, Mandy Guo, and Llion Jones. Character-level language modeling with deeper self-attention. InAAAI Conference on Artificial Intelligence, 2018. URL https://api.semanticscholar.org/CorpusID:52004855

  14. [14]

    Mambabyte: Token-free selective state space model

    Junxiong Wang, Tushaar Gangavarapu, Jing Nathan Yan, and Alexander M Rush. Mambabyte: Token-free selective state space model. InFirst Conference on Language Modeling, 2024. URL https://openreview.net/forum?id=X1xNsuKssb

  15. [15]

    Charformer: Fast character transformers via gradient-based subword tokenization.arXiv preprint arXiv:2106.12672, 2021

    Yi Tay, Vinh Q Tran, Sebastian Ruder, Jai Gupta, Hyung Won Chung, Dara Bahri, Zhen Qin, Simon Baumgartner, Cong Yu, and Donald Metzler. Charformer: Fast character transformers via gradient-based subword tokenization.arXiv preprint arXiv:2106.12672, 2021

  16. [16]

    37 Wenyue Hua, Lizhou Fan, Lingyao Li, Kai Mei, Jianchao Ji, Yingqiang Ge, Libby Hemphill, and Yongfeng Zhang

    Piotr Nawrot, Szymon Tworkowski, Michał Tyrolski, Lukasz Kaiser, Yuhuai Wu, Christian Szegedy, and Henryk Michalewski. Hierarchical transformers are more efficient language models. In Marine Carpuat, Marie-Catherine de Marneffe, and Ivan Vladimir Meza Ruiz, editors,Find- ings of the Association for Computational Linguistics: NAACL 2022 , pages 1559–1571, ...

  17. [17]

    Magnet: Improving the multilingual fairness of language models with adaptive gradient-based tokenization.Advances in Neural Information Processing Systems, 37: 47790–47814, 2024

    Orevaoghene Ahia, Sachin Kumar, Hila Gonen, Valentin Hofmann, Tomasz Limisiewicz, Yulia Tsvetkov, and Noah A Smith. Magnet: Improving the multilingual fairness of language models with adaptive gradient-based tokenization.Advances in Neural Information Processing Systems, 37: 47790–47814, 2024

  18. [18]

    arXiv preprint arXiv:2412.09871 , year=

    Artidoro Pagnoni, Ram Pasunuru, Pedro Rodriguez, John Nguyen, Benjamin Muller, Margaret Li, Chunting Zhou, Lili Yu, Jason Weston, Luke Zettlemoyer, Gargi Ghosh, Mike Lewis, Ari Holtzman, and Srinivasan Iyer. Byte latent transformer: Patches scale better than tokens, 2024. URL https://arxiv.org/abs/2412.09871

  19. [19]

    Nawrot, J

    Piotr Nawrot, Jan Chorowski, Adrian Lancucki, and Edoardo Maria Ponti. Efficient transformers with dynamic token pooling. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages 6403–6417, Toronto, Canada, July 2023. Association...

  20. [20]

    MEGABYTE: Predicting million-byte sequences with multiscale transformers

    LILI YU, Daniel Simig, Colin Flaherty, Armen Aghajanyan, Luke Zettlemoyer, and Mike Lewis. MEGABYTE: Predicting million-byte sequences with multiscale transformers. InThirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview. net/forum?id=JTmO2V9Xpz

  21. [21]

    The fineweb datasets: Decanting the web for the finest text data at scale.Advances in Neural Information Processing Systems, 37:30811–30849, 2024

    Guilherme Penedo, Hynek Kydlíček, Anton Lozhkov, Margaret Mitchell, Colin A Raffel, Leandro Von Werra, Thomas Wolf, et al. The fineweb datasets: Decanting the web for the finest text data at scale.Advances in Neural Information Processing Systems, 37:30811–30849, 2024

  22. [22]

    Fineweb2: A sparkling update with 1000s of languages, December 2024

    Guilherme Penedo, Hynek Kydlíček, Vinko Sabolčec, Bettina Messmer, Negar Foroutan, Martin Jaggi, Leandro von Werra, and Thomas Wolf. Fineweb2: A sparkling update with 1000s of languages, December 2024. URL https://huggingface.co/datasets/ HuggingFaceFW/fineweb-2. 11

  23. [23]

    XNLI: Evaluating Cross-lingual Sentence Representations

    Alexis Conneau, Guillaume Lample, Ruty Rinott, Adina Williams, Samuel R Bowman, Holger Schwenk, and Veselin Stoyanov. Xnli: Evaluating cross-lingual sentence representations.arXiv preprint arXiv:1809.05053, 2018

  24. [24]

    Alabi, Yanke Mao, Haonan Gao, and Annie En-Shiun Lee

    David Ifeoluwa Adelani, Hannah Liu, Xiaoyu Shen, Nikita Vassilyev, Jesujoba O Alabi, Yanke Mao, Haonan Gao, and Annie En-Shiun Lee. Sib-200: A simple, inclusive, and big evaluation dataset for topic classification in 200+ languages and dialects.arXiv preprint arXiv:2309.07445, 2023

  25. [25]

    Multilingualsentiment: A multilingual sentiment classification dataset, 2024

    clapAI. Multilingualsentiment: A multilingual sentiment classification dataset, 2024. URL https://huggingface.co/datasets/clapAI/MultiLingualSentiment

  26. [26]

    In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

    Xiaoman Pan, Boliang Zhang, Jonathan May, Joel Nothman, Kevin Knight, and Heng Ji. Cross- lingual name tagging and linking for 282 languages. InProceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages 1946–1958, Vancouver, Canada, July 2017. Association for Computational Linguistics. doi: 10.1...

  27. [27]

    Association for Computational Linguistics

    Marcos Zampieri, Preslav Nakov, Nikola Ljubešić, Jörg Tiedemann, Shervin Malmasi, and Ahmed Ali, editors.Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2018), Santa Fe, New Mexico, USA, August 2018. Association for Computational Linguistics. URL https://aclanthology.org/W18-3900/

  28. [28]

    Evaluating unsupervised text classification: zero-shot and similarity-based approaches

    Tim Schopf, Daniel Braun, and Florian Matthes. Evaluating unsupervised text classification: zero-shot and similarity-based approaches. InProceedings of the 2022 6th International Conference on Natural Language Processing and Information Retrieval, pages 6–15, 2022

  29. [29]

    Wlv at semeval-2018 task 3: Dissecting tweets in search of irony

    Omid Rohanian, Shiva Taslimipoor, Richard Evans, and Ruslan Mitkov. Wlv at semeval-2018 task 3: Dissecting tweets in search of irony. In Proceedings of The 12th International Workshop on Semantic Evaluation, pages 553–559, 2018

  30. [30]

    No Language Left Behind: Scaling Human-Centered Machine Translation

    Marta R Costa-Jussà, James Cross, Onur Çelebi, Maha Elbayad, Kenneth Heafield, Kevin Hef- fernan, Elahe Kalbassi, Janice Lam, Daniel Licht, Jean Maillard, et al. No language left behind: Scaling human-centered machine translation.arXiv preprint arXiv:2207.04672, 2022

  31. [31]

    Adam: A Method for Stochastic Optimization

    Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014

  32. [32]

    Generating Sequences With Recurrent Neural Networks

    Alex Graves. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850, 2013

  33. [33]

    Scaling laws with vocabulary: Larger models deserve larger vocabularies, 2024

    Chaofan Tao, Qian Liu, Longxu Dou, Niklas Muennighoff, Zhongwei Wan, Ping Luo, Min Lin, and Ngai Wong. Scaling laws with vocabulary: Larger models deserve larger vocabularies, 2024. URL https://arxiv.org/abs/2407.13623

  34. [34]

    In: Zong, C., Xia, F., Li, W., Navigli, R

    Tomasz Limisiewicz, Terra Blevins, Hila Gonen, Orevaoghene Ahia, and Luke Zettlemoyer. MYTE: Morphology-driven byte encoding for better and fairer multilingual language modeling. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages ...

  35. [35]

    Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Xin Zhao, and Ji-Rong Wen

    Jonas Lotz, Elizabeth Salesky, Phillip Rust, and Desmond Elliott. Text rendering strategies for pixel language models. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages 10155–10172, Singapore, December 2023. Association for Computational Linguistics. doi: 10...

  36. [36]

    Lotz, Emanuele Bugliarello, Elizabeth Salesky, Miryam de Lhoneux, and Desmond Elliott

    Phillip Rust, Jonas F. Lotz, Emanuele Bugliarello, Elizabeth Salesky, Miryam de Lhoneux, and Desmond Elliott. Language modelling with pixels. InThe Eleventh International Confer- ence on Learning Representations, 2023. URL https://openreview.net/forum?id= FkSp8VW8RjH. 12

  37. [37]

    Multilingual pixel representations for translation and effective cross-lingual transfer

    Elizabeth Salesky, Neha Verma, Philipp Koehn, and Matt Post. Multilingual pixel representations for translation and effective cross-lingual transfer. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 13845–13861, Singapore, December 2023. Association for Com...

  38. [38]

    and Garrette, Dan and Turc, Iulia and Wieting, John

    Jonathan H. Clark, Dan Garrette, Iulia Turc, and John Wieting. Canine: Pre-training an efficient tokenization-free encoder for language representation. Transactions of the Associ- ation for Computational Linguistics , 10:73–91, 2022. doi: 10.1162/tacl_a_00448. URL https://aclanthology.org/2022.tacl-1.5/

  39. [39]

    MANTa: Efficient gradient-based tokenization for end-to-end robust language modeling

    Nathan Godey, Roman Castagné, Éric de la Clergerie, and Benoît Sagot. MANTa: Efficient gradient-based tokenization for end-to-end robust language modeling. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, editors, Findings of the Association for Computational Linguistics: EMNLP 2022, pages 2859–2870, Abu Dhabi, United Arab Emirates, December 2022. Asso...

  40. [40]

    Tran, Sebastian Ruder, Jai Gupta, Hyung Won Chung, Dara Bahri, Zhen Qin, Simon Baumgartner, Cong Yu, and Donald Metzler

    Yi Tay, Vinh Q. Tran, Sebastian Ruder, Jai Gupta, Hyung Won Chung, Dara Bahri, Zhen Qin, Simon Baumgartner, Cong Yu, and Donald Metzler. Charformer: Fast character transformers via gradient-based subword tokenization. InInternational Conference on Learning Representations,

  41. [41]

    URL https://openreview.net/forum?id=JtBRnrlOEFN

  42. [42]

    A vocabulary-free multilingual neural tokenizer for end-to-end task learning

    Md Mofijul Islam, Gustavo Aguilar, Pragaash Ponnusamy, Clint Solomon Mathialagan, Chengyuan Ma, and Chenlei Guo. A vocabulary-free multilingual neural tokenizer for end-to-end task learning. In Spandana Gella, He He, Bodhisattwa Prasad Majumder, Burcu Can, Eleonora Giunchiglia, Samuel Cahyawijaya, Sewon Min, Maximilian Mozes, Xiang Lorraine Li, Isabelle A...

  43. [43]

    Tokenization counts: the impact of tokenization on arithmetic in frontier LLMs.arXiv preprint arXiv:2402.14903,

    Aaditya K. Singh and DJ Strouse. Tokenization counts: the impact of tokenization on arithmetic in frontier llms, 2024. URLhttps://arxiv.org/abs/2402.14903

  44. [44]

    Improving consistency in LLM inference using probabilistic tokenization

    Ashutosh Sathe, Divyanshu Aggarwal, and Sunayana Sitaram. Improving consistency in LLM inference using probabilistic tokenization. In Luis Chiruzzo, Alan Ritter, and Lu Wang, editors, Findings of the Association for Computational Linguistics: NAACL 2025 , pages 4766–4778, Albuquerque, New Mexico, April 2025. Association for Computational Linguistics. ISBN...

  45. [45]

    Should we find another model?: Improving neural machine translation performance with ONE-piece tokenization method without model modification

    Chanjun Park, Sugyeong Eo, Hyeonseok Moon, and Heuiseok Lim. Should we find another model?: Improving neural machine translation performance with ONE-piece tokenization method without model modification. In Young-bum Kim, Yunyao Li, and Owen Rambow, editors, Proceedings of the 2021 Conference of the North American Chapter of the Association for Computatio...

  46. [46]

    Alabi, David Ifeoluwa Adelani, Marius Mosbach, and Dietrich Klakow

    Jesujoba O. Alabi, David Ifeoluwa Adelani, Marius Mosbach, and Dietrich Klakow. Adapting pre- trained language models to African languages via multilingual adaptive fine-tuning. In Nicoletta Calzolari, Chu-Ren Huang, Hansaem Kim, James Pustejovsky, Leo Wanner, Key-Sun Choi, Pum-Mo Ryu, Hsin-Hsi Chen, Lucia Donatelli, Heng Ji, Sadao Kurohashi, Patrizia Pag...

  47. [47]

    WECHSEL : Effective initialization of subword embeddings for cross-lingual transfer of monolingual language models

    Benjamin Minixhofer, Fabian Paischer, and Navid Rekabsaz. WECHSEL: Effective initialization of subword embeddings for cross-lingual transfer of monolingual language models. In Marine 13 Carpuat, Marie-Catherine de Marneffe, and Ivan Vladimir Meza Ruiz, editors, Proceedings of the 2022 Conference of the North American Chapter of the Association for Computa...

  48. [48]

    Efficient Domain Adaptation of Language Models via Adaptive Tokenization

    Vin Sachidananda, Jason Kessler, and Yi-An Lai. Efficient domain adaptation of language models via adaptive tokenization. In Nafise Sadat Moosavi, Iryna Gurevych, Angela Fan, Thomas Wolf, Yufang Hou, Ana Marasović, and Sujith Ravi, editors, Proceedings of the Second Workshop on Simple and Efficient Natural Language Processing, pages 155–165, Virtual, Nove...

  49. [49]

    Task- adaptive tokenization: Enhancing long-form text generation efficacy in mental health and beyond

    Siyang Liu, Naihao Deng, Sahand Sabour, Yilin Jia, Minlie Huang, and Rada Mihalcea. Task- adaptive tokenization: Enhancing long-form text generation efficacy in mental health and beyond. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 15264–15281, Singapore...

  50. [50]

    Task-Adaptive Tokenization: Enhancing Long-Form Text Generation Efficacy in Mental Health and Beyond

    Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.944. URL https://aclanthology.org/2023.emnlp-main.944/. 14 Appendix A Limitations Our limited computational budget prevents us from training larger models with more language on larger datasets. We anticipate the results will improve with scaling potentially providing even higher c...