FLEXITOKENS: Flexible Tokenization for Evolving Language Models
Pith reviewed 2026-05-19 05:07 UTC · model grok-4.3
The pith
A simplified training objective lets byte-level language models learn flexible token boundaries that adapt to new domains and languages without fixed compression constraints.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By training the boundary predictor with a simplified objective that omits the auxiliary fixed-compression loss, byte-level language models acquire tokenizations that adapt to the input distribution rather than remaining locked to a preset compression rate, yielding lower over-fragmentation and higher downstream accuracy on out-of-distribution data.
What carries the argument
The FLEXITOKENS simplified training objective for the byte-sequence boundary predictor, which learns to insert variable-length token boundaries without an auxiliary term that enforces fixed corpus-wide compression.
If this is right
- Token over-fragmentation decreases on out-of-distribution domains and morphologically diverse languages.
- Downstream token classification and generative tasks improve by as much as 10 percentage points relative to BPE and other gradient-based tokenizer baselines.
- Performance gains remain consistent when the same method is applied to models of different sizes.
- Adaptation to new data distributions becomes possible through finetuning alone without manual tokenizer replacement.
Where Pith is reading between the lines
- Models could continue learning new segmentation strategies as they encounter fresh data streams over time, supporting longer-term evolution without periodic tokenizer retraining.
- The removal of the fixed-compression auxiliary term may simplify extension to low-resource languages or mixed-script text that current fixed tokenizers handle poorly.
- Similar simplification of auxiliary objectives could be tested in other adaptive components such as dynamic vocabulary growth or on-the-fly embedding updates.
Load-bearing premise
The boundary predictor trained with the simplified objective will still produce segmentations useful for downstream performance rather than collapsing to trivial or degenerate boundaries.
What would settle it
A controlled experiment on new domains or unseen scripts in which the FLEXITOKENS model shows no reduction in over-fragmentation and no accuracy gains over BPE or fixed-compression baselines would falsify the central claim.
Figures
read the original abstract
Adapting language models to new data distributions by simple finetuning is challenging. This is due to the rigidity of their subword tokenizers, which typically remain unchanged during adaptation. This inflexibility often leads to inefficient tokenization, causing overfragmentation of text in out-of-distribution domains, unseen languages, or scripts. In this work, we develop byte-level LMs with learnable tokenizers to make tokenization adaptive. Our models include a submodule that learns to predict boundaries given the input byte sequence, encoding it into variable-length segments. Most tokenizer-free methods train this boundary predictor using an auxiliary loss that enforces a fixed compression rate across the training corpus, introducing a new kind of rigidity. We propose FLEXITOKENS, a simplified training objective that enables significantly greater flexibility during adaptation. Evaluating across multiple multilingual benchmarks, morphologically diverse tasks, and domains, we demonstrate that FLEXITOKENS consistently reduces token over-fragmentation and achieves up to 10% point improvements on token classification and generative tasks compared to BPE and other gradient-based tokenizer baselines. We validate our findings using models of varying sizes, and our method demonstrates consistent improvements across scales. Code and data for our experiments will be released at https://github.com/skai-research/flexitokens
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces FLEXITOKENS, a byte-level language model with a learnable boundary predictor for variable-length tokenization. It proposes a simplified training objective that removes the auxiliary fixed-compression loss common in prior tokenizer-free approaches, claiming this yields greater flexibility during adaptation to new distributions, languages, or domains. The authors report that the method reduces token over-fragmentation and delivers up to 10 percentage point gains on token classification and generative tasks relative to BPE and gradient-based tokenizer baselines, with consistent improvements across model scales and multilingual/morphologically diverse benchmarks. Code and data release is promised.
Significance. If the central empirical claims are substantiated, the work could meaningfully advance adaptive tokenization for evolving language models by addressing subword rigidity in OOD settings. The simplification of the objective to remove an auxiliary constraint is a clear conceptual contribution. Credit is due for the cross-scale validation and the commitment to public code and data release, which directly supports reproducibility in this area.
major comments (2)
- [Method] Method section (description of the simplified objective): the boundary predictor is trained solely on the main task loss without the auxiliary compression term. No analysis or ablation is provided to demonstrate that this does not lead to degenerate fixed policies (always-emit or never-emit boundaries), which would directly undermine the claimed reduction in over-fragmentation and downstream gains. This is load-bearing for the headline performance claims.
- [Experiments] Experiments section: the abstract states consistent improvements and up to 10-point gains, yet the reported results lack details on statistical significance testing, precise baseline re-implementations, data splits, or explicit ablations isolating the effect of removing the auxiliary loss. These omissions prevent full verification of the cross-scale and cross-task claims.
minor comments (1)
- [Abstract] The abstract would be clearer if it briefly noted the range of model sizes used for the scale-consistency validation.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback and for highlighting areas where additional evidence would strengthen the manuscript. We address each major comment below and commit to revisions that directly respond to the concerns raised.
read point-by-point responses
-
Referee: [Method] Method section (description of the simplified objective): the boundary predictor is trained solely on the main task loss without the auxiliary compression term. No analysis or ablation is provided to demonstrate that this does not lead to degenerate fixed policies (always-emit or never-emit boundaries), which would directly undermine the claimed reduction in over-fragmentation and downstream gains. This is load-bearing for the headline performance claims.
Authors: We agree that an explicit demonstration that the simplified objective avoids degenerate boundary policies is important for substantiating the core claims. In the current manuscript we relied on the fact that the boundary predictor is optimized jointly with the language modeling loss, which should penalize policies that produce uninformative segmentations (as these would increase perplexity). However, this argument is indirect. To address the referee's point directly, we will add an ablation in the revised manuscript that compares the learned boundary distributions against forced always-emit and never-emit baselines, along with the resulting tokenization statistics and downstream performance. This analysis will be placed in the method or experiments section. revision: yes
-
Referee: [Experiments] Experiments section: the abstract states consistent improvements and up to 10-point gains, yet the reported results lack details on statistical significance testing, precise baseline re-implementations, data splits, or explicit ablations isolating the effect of removing the auxiliary loss. These omissions prevent full verification of the cross-scale and cross-task claims.
Authors: We acknowledge that the current experimental reporting is insufficient for full verification. In the revision we will expand the experiments section to include: (i) statistical significance testing (multiple random seeds with reported means, standard deviations, and p-values where appropriate), (ii) precise descriptions of baseline re-implementations including all hyperparameters and training details, (iii) explicit documentation of data splits and preprocessing, and (iv) a dedicated ablation that isolates the contribution of removing the auxiliary compression loss. These additions will allow readers to reproduce and verify the cross-scale and cross-task results. revision: yes
Circularity Check
No circularity: empirical proposal with independent experimental validation
full rationale
The paper introduces FLEXITOKENS as a new simplified training objective for a byte-level boundary predictor, removing the auxiliary fixed-compression loss used in prior tokenizer-free methods. All performance claims (reduced over-fragmentation, up to 10-point gains on classification and generation tasks) are supported by direct empirical evaluation on multilingual benchmarks, morphologically diverse tasks, and domain shifts across model scales. No equations, derivations, or fitted parameters are presented as 'predictions' that reduce to the inputs by construction. No self-citations, uniqueness theorems, or ansatzes from prior author work are invoked as load-bearing justification. The central result is therefore an empirical training change whose validity rests on external benchmark outcomes rather than any self-referential reduction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Byte sequences provide a complete and sufficient representation for learning token boundaries in any script or language
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AlphaCoordinateFixation.leanalpha_pin_under_high_calibration unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We propose FlexıTokens, a simplified training objective... L_BP = max(k/N − α, 0) + max(β − k/N, 0)
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanabsolute_floor_iff_bare_distinguishability unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
boundary predictor... hard Gumbel sigmoid re-parameterization
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 2 Pith papers
-
Scratchpad Patching: Decoupling Compute from Patch Size in Byte-Level Language Models
Scratchpad Patching decouples compute from patch size in byte-level language models by inserting entropy-triggered scratchpads to update patch context dynamically.
-
Compute Optimal Tokenization
Compute-optimal language models require parameter count to scale with data bytes rather than tokens, with optimal token compression rate decreasing as compute budget grows.
Reference graph
Works this paper leans on
-
[1]
Mehdi Ali, Michael Fromm, Klaudia Thellmann, Richard Rutmann, Max Lübbering, Johannes Leveling, Katrin Klug, Jan Ebert, Niclas Doll, Jasper Buschhoff, et al. Tokenizer choice for llm training: Negligible or crucial? In Findings of the Association for Computational Linguistics: NAACL 2024, pages 3907–3924, 2024
work page 2024
-
[2]
Coercing llms to do and reveal (almost) anything
Jonas Geiping, Alex Stein, Manli Shu, Khalid Saifullah, Yuxin Wen, and Tom Goldstein. Coercing llms to do and reveal (almost) anything.arXiv preprint arXiv:2402.14020, 2024
-
[3]
arXiv preprint arXiv:2405.05417 , year=
Sander Land and Max Bartolo. Fishing for magikarp: Automatically detecting under-trained tokens in large language models.arXiv preprint arXiv:2405.05417, 2024
-
[4]
Neural machine translation of rare words with subword units
Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword units. In Katrin Erk and Noah A. Smith, editors, Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1715–1725, Berlin, Germany, August 2016. Association for Computational Linguistics. d...
-
[5]
Bert: Pre-training of deep bidirectional transformers for language understanding
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. InProceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers) , pages 4171–4186, 2019
work page 2019
-
[6]
Jacovi, A., Caciularu, A., Goldman, O., and Goldberg, Y
Orevaoghene Ahia, Sachin Kumar, Hila Gonen, Jungo Kasai, David Mortensen, Noah Smith, and Yulia Tsvetkov. Do all languages cost the same? tokenization in the era of commercial language models. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 9904–9923, Singa...
-
[7]
URL https://aclanthology.org/2023.emnlp-main.614/
work page 2023
-
[8]
Aleksandar Petrov, Emanuele La Malfa, Philip Torr, and Adel Bibi. Language model tokenizers introduce unfairness between languages.Advances in neural information processing systems, 36: 36963–36990, 2023
work page 2023
-
[9]
Getting the most out of your tokenizer for pre-training and domain adaptation
Gautier Dagan, Gabriel Synnaeve, and Baptiste Rozière. Getting the most out of your tokenizer for pre-training and domain adaptation. InProceedings of the 41st International Conference on Machine Learning, ICML’24. JMLR.org, 2024. 10
work page 2024
-
[10]
Zero-shot tokenizer transfer.arXiv preprint arXiv:2405.07883, 2024
Benjamin Minixhofer, Edoardo Maria Ponti, and Ivan Vulić. Zero-shot tokenizer transfer.arXiv preprint arXiv:2405.07883, 2024
-
[11]
Li, J.; Huang, S.; Ching, A.; Dai, X.; and Chen, J
Haonan Li, Fajri Koto, Minghao Wu, Alham Fikri Aji, and Timothy Baldwin. Bactrian-x: Multilingual replicable instruction-following models with low-rank adaptation.arXiv preprint arXiv:2305.15011, 2023
-
[12]
ByT5: Towards a Token-Free Future with Pre-trained Byte-to-Byte Models
Linting Xue, Aditya Barua, Noah Constant, Rami Al-Rfou, Sharan Narang, Mihir Kale, Adam Roberts, and Colin Raffel. ByT5: Towards a token-free future with pre-trained byte-to-byte models. Transactions of the Association for Computational Linguistics, 10:291–306, 2022. doi: 10.1162/tacl_a_00461. URL https://aclanthology.org/2022.tacl-1.17/
-
[13]
Character-level language modeling with deeper self-attention
Rami Al-Rfou, Dokook Choe, Noah Constant, Mandy Guo, and Llion Jones. Character-level language modeling with deeper self-attention. InAAAI Conference on Artificial Intelligence, 2018. URL https://api.semanticscholar.org/CorpusID:52004855
work page 2018
-
[14]
Mambabyte: Token-free selective state space model
Junxiong Wang, Tushaar Gangavarapu, Jing Nathan Yan, and Alexander M Rush. Mambabyte: Token-free selective state space model. InFirst Conference on Language Modeling, 2024. URL https://openreview.net/forum?id=X1xNsuKssb
work page 2024
-
[15]
Yi Tay, Vinh Q Tran, Sebastian Ruder, Jai Gupta, Hyung Won Chung, Dara Bahri, Zhen Qin, Simon Baumgartner, Cong Yu, and Donald Metzler. Charformer: Fast character transformers via gradient-based subword tokenization.arXiv preprint arXiv:2106.12672, 2021
-
[16]
Piotr Nawrot, Szymon Tworkowski, Michał Tyrolski, Lukasz Kaiser, Yuhuai Wu, Christian Szegedy, and Henryk Michalewski. Hierarchical transformers are more efficient language models. In Marine Carpuat, Marie-Catherine de Marneffe, and Ivan Vladimir Meza Ruiz, editors,Find- ings of the Association for Computational Linguistics: NAACL 2022 , pages 1559–1571, ...
-
[17]
Orevaoghene Ahia, Sachin Kumar, Hila Gonen, Valentin Hofmann, Tomasz Limisiewicz, Yulia Tsvetkov, and Noah A Smith. Magnet: Improving the multilingual fairness of language models with adaptive gradient-based tokenization.Advances in Neural Information Processing Systems, 37: 47790–47814, 2024
work page 2024
-
[18]
arXiv preprint arXiv:2412.09871 , year=
Artidoro Pagnoni, Ram Pasunuru, Pedro Rodriguez, John Nguyen, Benjamin Muller, Margaret Li, Chunting Zhou, Lili Yu, Jason Weston, Luke Zettlemoyer, Gargi Ghosh, Mike Lewis, Ari Holtzman, and Srinivasan Iyer. Byte latent transformer: Patches scale better than tokens, 2024. URL https://arxiv.org/abs/2412.09871
-
[19]
Piotr Nawrot, Jan Chorowski, Adrian Lancucki, and Edoardo Maria Ponti. Efficient transformers with dynamic token pooling. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages 6403–6417, Toronto, Canada, July 2023. Association...
-
[20]
MEGABYTE: Predicting million-byte sequences with multiscale transformers
LILI YU, Daniel Simig, Colin Flaherty, Armen Aghajanyan, Luke Zettlemoyer, and Mike Lewis. MEGABYTE: Predicting million-byte sequences with multiscale transformers. InThirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview. net/forum?id=JTmO2V9Xpz
work page 2023
-
[21]
Guilherme Penedo, Hynek Kydlíček, Anton Lozhkov, Margaret Mitchell, Colin A Raffel, Leandro Von Werra, Thomas Wolf, et al. The fineweb datasets: Decanting the web for the finest text data at scale.Advances in Neural Information Processing Systems, 37:30811–30849, 2024
work page 2024
-
[22]
Fineweb2: A sparkling update with 1000s of languages, December 2024
Guilherme Penedo, Hynek Kydlíček, Vinko Sabolčec, Bettina Messmer, Negar Foroutan, Martin Jaggi, Leandro von Werra, and Thomas Wolf. Fineweb2: A sparkling update with 1000s of languages, December 2024. URL https://huggingface.co/datasets/ HuggingFaceFW/fineweb-2. 11
work page 2024
-
[23]
XNLI: Evaluating Cross-lingual Sentence Representations
Alexis Conneau, Guillaume Lample, Ruty Rinott, Adina Williams, Samuel R Bowman, Holger Schwenk, and Veselin Stoyanov. Xnli: Evaluating cross-lingual sentence representations.arXiv preprint arXiv:1809.05053, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[24]
Alabi, Yanke Mao, Haonan Gao, and Annie En-Shiun Lee
David Ifeoluwa Adelani, Hannah Liu, Xiaoyu Shen, Nikita Vassilyev, Jesujoba O Alabi, Yanke Mao, Haonan Gao, and Annie En-Shiun Lee. Sib-200: A simple, inclusive, and big evaluation dataset for topic classification in 200+ languages and dialects.arXiv preprint arXiv:2309.07445, 2023
-
[25]
Multilingualsentiment: A multilingual sentiment classification dataset, 2024
clapAI. Multilingualsentiment: A multilingual sentiment classification dataset, 2024. URL https://huggingface.co/datasets/clapAI/MultiLingualSentiment
work page 2024
-
[26]
Xiaoman Pan, Boliang Zhang, Jonathan May, Joel Nothman, Kevin Knight, and Heng Ji. Cross- lingual name tagging and linking for 282 languages. InProceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages 1946–1958, Vancouver, Canada, July 2017. Association for Computational Linguistics. doi: 10.1...
-
[27]
Association for Computational Linguistics
Marcos Zampieri, Preslav Nakov, Nikola Ljubešić, Jörg Tiedemann, Shervin Malmasi, and Ahmed Ali, editors.Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2018), Santa Fe, New Mexico, USA, August 2018. Association for Computational Linguistics. URL https://aclanthology.org/W18-3900/
work page 2018
-
[28]
Evaluating unsupervised text classification: zero-shot and similarity-based approaches
Tim Schopf, Daniel Braun, and Florian Matthes. Evaluating unsupervised text classification: zero-shot and similarity-based approaches. InProceedings of the 2022 6th International Conference on Natural Language Processing and Information Retrieval, pages 6–15, 2022
work page 2022
-
[29]
Wlv at semeval-2018 task 3: Dissecting tweets in search of irony
Omid Rohanian, Shiva Taslimipoor, Richard Evans, and Ruslan Mitkov. Wlv at semeval-2018 task 3: Dissecting tweets in search of irony. In Proceedings of The 12th International Workshop on Semantic Evaluation, pages 553–559, 2018
work page 2018
-
[30]
No Language Left Behind: Scaling Human-Centered Machine Translation
Marta R Costa-Jussà, James Cross, Onur Çelebi, Maha Elbayad, Kenneth Heafield, Kevin Hef- fernan, Elahe Kalbassi, Janice Lam, Daniel Licht, Jean Maillard, et al. No language left behind: Scaling human-centered machine translation.arXiv preprint arXiv:2207.04672, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[31]
Adam: A Method for Stochastic Optimization
Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[32]
Generating Sequences With Recurrent Neural Networks
Alex Graves. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850, 2013
work page internal anchor Pith review Pith/arXiv arXiv 2013
-
[33]
Scaling laws with vocabulary: Larger models deserve larger vocabularies, 2024
Chaofan Tao, Qian Liu, Longxu Dou, Niklas Muennighoff, Zhongwei Wan, Ping Luo, Min Lin, and Ngai Wong. Scaling laws with vocabulary: Larger models deserve larger vocabularies, 2024. URL https://arxiv.org/abs/2407.13623
-
[34]
In: Zong, C., Xia, F., Li, W., Navigli, R
Tomasz Limisiewicz, Terra Blevins, Hila Gonen, Orevaoghene Ahia, and Luke Zettlemoyer. MYTE: Morphology-driven byte encoding for better and fairer multilingual language modeling. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages ...
-
[35]
Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Xin Zhao, and Ji-Rong Wen
Jonas Lotz, Elizabeth Salesky, Phillip Rust, and Desmond Elliott. Text rendering strategies for pixel language models. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages 10155–10172, Singapore, December 2023. Association for Computational Linguistics. doi: 10...
-
[36]
Lotz, Emanuele Bugliarello, Elizabeth Salesky, Miryam de Lhoneux, and Desmond Elliott
Phillip Rust, Jonas F. Lotz, Emanuele Bugliarello, Elizabeth Salesky, Miryam de Lhoneux, and Desmond Elliott. Language modelling with pixels. InThe Eleventh International Confer- ence on Learning Representations, 2023. URL https://openreview.net/forum?id= FkSp8VW8RjH. 12
work page 2023
-
[37]
Multilingual pixel representations for translation and effective cross-lingual transfer
Elizabeth Salesky, Neha Verma, Philipp Koehn, and Matt Post. Multilingual pixel representations for translation and effective cross-lingual transfer. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 13845–13861, Singapore, December 2023. Association for Com...
-
[38]
and Garrette, Dan and Turc, Iulia and Wieting, John
Jonathan H. Clark, Dan Garrette, Iulia Turc, and John Wieting. Canine: Pre-training an efficient tokenization-free encoder for language representation. Transactions of the Associ- ation for Computational Linguistics , 10:73–91, 2022. doi: 10.1162/tacl_a_00448. URL https://aclanthology.org/2022.tacl-1.5/
-
[39]
MANTa: Efficient gradient-based tokenization for end-to-end robust language modeling
Nathan Godey, Roman Castagné, Éric de la Clergerie, and Benoît Sagot. MANTa: Efficient gradient-based tokenization for end-to-end robust language modeling. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, editors, Findings of the Association for Computational Linguistics: EMNLP 2022, pages 2859–2870, Abu Dhabi, United Arab Emirates, December 2022. Asso...
-
[40]
Yi Tay, Vinh Q. Tran, Sebastian Ruder, Jai Gupta, Hyung Won Chung, Dara Bahri, Zhen Qin, Simon Baumgartner, Cong Yu, and Donald Metzler. Charformer: Fast character transformers via gradient-based subword tokenization. InInternational Conference on Learning Representations,
-
[41]
URL https://openreview.net/forum?id=JtBRnrlOEFN
-
[42]
A vocabulary-free multilingual neural tokenizer for end-to-end task learning
Md Mofijul Islam, Gustavo Aguilar, Pragaash Ponnusamy, Clint Solomon Mathialagan, Chengyuan Ma, and Chenlei Guo. A vocabulary-free multilingual neural tokenizer for end-to-end task learning. In Spandana Gella, He He, Bodhisattwa Prasad Majumder, Burcu Can, Eleonora Giunchiglia, Samuel Cahyawijaya, Sewon Min, Maximilian Mozes, Xiang Lorraine Li, Isabelle A...
-
[43]
Aaditya K. Singh and DJ Strouse. Tokenization counts: the impact of tokenization on arithmetic in frontier llms, 2024. URLhttps://arxiv.org/abs/2402.14903
-
[44]
Improving consistency in LLM inference using probabilistic tokenization
Ashutosh Sathe, Divyanshu Aggarwal, and Sunayana Sitaram. Improving consistency in LLM inference using probabilistic tokenization. In Luis Chiruzzo, Alan Ritter, and Lu Wang, editors, Findings of the Association for Computational Linguistics: NAACL 2025 , pages 4766–4778, Albuquerque, New Mexico, April 2025. Association for Computational Linguistics. ISBN...
work page 2025
-
[45]
Chanjun Park, Sugyeong Eo, Hyeonseok Moon, and Heuiseok Lim. Should we find another model?: Improving neural machine translation performance with ONE-piece tokenization method without model modification. In Young-bum Kim, Yunyao Li, and Owen Rambow, editors, Proceedings of the 2021 Conference of the North American Chapter of the Association for Computatio...
-
[46]
Alabi, David Ifeoluwa Adelani, Marius Mosbach, and Dietrich Klakow
Jesujoba O. Alabi, David Ifeoluwa Adelani, Marius Mosbach, and Dietrich Klakow. Adapting pre- trained language models to African languages via multilingual adaptive fine-tuning. In Nicoletta Calzolari, Chu-Ren Huang, Hansaem Kim, James Pustejovsky, Leo Wanner, Key-Sun Choi, Pum-Mo Ryu, Hsin-Hsi Chen, Lucia Donatelli, Heng Ji, Sadao Kurohashi, Patrizia Pag...
work page 2022
-
[47]
Benjamin Minixhofer, Fabian Paischer, and Navid Rekabsaz. WECHSEL: Effective initialization of subword embeddings for cross-lingual transfer of monolingual language models. In Marine 13 Carpuat, Marie-Catherine de Marneffe, and Ivan Vladimir Meza Ruiz, editors, Proceedings of the 2022 Conference of the North American Chapter of the Association for Computa...
-
[48]
Efficient Domain Adaptation of Language Models via Adaptive Tokenization
Vin Sachidananda, Jason Kessler, and Yi-An Lai. Efficient domain adaptation of language models via adaptive tokenization. In Nafise Sadat Moosavi, Iryna Gurevych, Angela Fan, Thomas Wolf, Yufang Hou, Ana Marasović, and Sujith Ravi, editors, Proceedings of the Second Workshop on Simple and Efficient Natural Language Processing, pages 155–165, Virtual, Nove...
-
[49]
Siyang Liu, Naihao Deng, Sahand Sabour, Yilin Jia, Minlie Huang, and Rada Mihalcea. Task- adaptive tokenization: Enhancing long-form text generation efficacy in mental health and beyond. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 15264–15281, Singapore...
work page 2023
-
[50]
Task-Adaptive Tokenization: Enhancing Long-Form Text Generation Efficacy in Mental Health and Beyond
Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.944. URL https://aclanthology.org/2023.emnlp-main.944/. 14 Appendix A Limitations Our limited computational budget prevents us from training larger models with more language on larger datasets. We anticipate the results will improve with scaling potentially providing even higher c...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.