Sampling from Your Language Model One Byte at a Time

arxiv: 2506.14123 · v3 · submitted 2025-06-17 · 💻 cs.CL · cs.FL· cs.LG

Sampling from Your Language Model One Byte at a Time

Jonathan Hayase , Alisa Liu , Noah A. Smith , Sewoong Oh This is my paper

Pith reviewed 2026-05-19 09:49 UTC · model grok-4.3

classification 💻 cs.CL cs.FLcs.LG

keywords prompt boundary problembyte-level samplingBPE tokenizerinference-time conversionlanguage model ensemblingproxy-tuningtokenization distortionautoregressive sampling

0 comments p. Extension

The pith

A new inference-time method converts any BPE language model into a byte-level sampler that eliminates the prompt boundary problem.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a technique that turns any autoregressive language model using a BPE tokenizer into an equivalent character-level or byte-level model during sampling. This directly addresses the Prompt Boundary Problem, where token boundaries distort model outputs for prompts ending in spaces or for languages and code where tokens do not align with natural boundaries. The same conversion unifies vocabularies across models with different tokenizers. As a result, models can be ensembled at inference time or have post-training transferred between them through proxy-tuning. The approach is designed to run efficiently while keeping the original conditional distributions intact.

Core claim

We present an inference-time method to convert any autoregressive LM with a BPE tokenizer into a character-level or byte-level LM. Our method efficiently solves the PBP and is also able to unify the vocabularies of language models with different tokenizers, allowing one to ensemble LMs with different tokenizers at inference time or transfer the post-training from one model to another using proxy-tuning.

What carries the argument

The byte-level sampling conversion, which reparameterizes token probabilities so that the model can be sampled one byte at a time while exactly matching the original autoregressive distributions over text.

If this is right

The prompt boundary problem is solved for any prompt, including those ending in spaces and for code or Chinese text.
Language models with incompatible tokenizers can be combined into ensembles at inference time.
Post-training performed on one model can be transferred to another via proxy-tuning without retraining the target.
The original conditional probability distributions over sequences remain unchanged by the conversion.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The technique may improve output consistency in multilingual and programming contexts where token boundaries rarely match linguistic or syntactic units.
Similar reparameterization ideas could be tested on tokenizers other than BPE to broaden applicability.
Targeted experiments on edge-case prompts in Chinese or code would directly measure whether tokenization distortions disappear.

Load-bearing premise

Byte-level sampling can be performed at inference time while preserving the original model's exact conditional distributions over byte sequences and without prohibitive slowdown.

What would settle it

A side-by-side comparison in which the probability of a given byte sequence differs between the original model and the converted byte sampler on the same prompt, or where inference latency increases substantially.

Figures

Figures reproduced from arXiv: 2506.14123 by Alisa Liu, Jonathan Hayase, Noah A. Smith, Sewoong Oh.

**Figure 1.** Figure 1: ByteSampler resolves the prompt boundary problem (exhibited in the output of generate()). In this example, test, 都是, and .getElementById are all single tokens in the respective tokenizers. The Prompt Boundary Problem (PBP). In particular, Eq. (1) introduces distortion whenever the prompt ends on a prefix of what could otherwise be a single token. More concretely, consider LLAMA3.2-1B and suppose the use… view at source ↗

**Figure 2.** Figure 2: Construction of the Valid Covering Tree for string prefix “hypot”: (a) starting with the infinite tree of all possible token sequences (many edges not shown), we prune branches that (b) do not match the given prefix or begin after the prefix ends or (c) contain invalid contiguous pairs of tokens. More example trees are shown in Appendix E. The tree depicted in Fig. 2b corresponds to the cover described in … view at source ↗

**Figure 3.** Figure 3: Example of valid and invalid token pairs. We show the initial string’s bytes and the merges [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Example Valid Cover Tree for prefix “this is a tes” with the OLM [PITH_FULL_IMAGE:figures/full_fig_p022_4.png] view at source ↗

**Figure 6.** Figure 6: Example Valid Cover Tree for prefix “BPE Tokenizatio” with the OLM [PITH_FULL_IMAGE:figures/full_fig_p022_6.png] view at source ↗

**Figure 7.** Figure 7: Example Valid Cover Tree for prefix “inductive hypothe” with the OLM [PITH_FULL_IMAGE:figures/full_fig_p022_7.png] view at source ↗

read the original abstract

Tokenization is used almost universally by modern language models, enabling efficient text representation using multi-byte or multi-character tokens. However, prior work has shown that tokenization can introduce distortion into the model's generations, an issue known as the Prompt Boundary Problem (PBP). For example, users are often advised not to end their prompts with a space because it prevents the model from including the space as part of the next token. While this heuristic is effective in English, the underlying PBP continues to affect code generation and languages such as Chinese, where tokens often do not line up with word and syntactic boundaries. In this work, we present an inference-time method to convert any autoregressive LM with a BPE tokenizer into a character-level or byte-level LM. Our method efficiently solves the PBP and is also able to unify the vocabularies of language models with different tokenizers, allowing one to ensemble LMs with different tokenizers at inference time or transfer the post-training from one model to another using proxy-tuning. Code is available at https://github.com/SewoongLab/byte-sampler .

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows how to turn any BPE LM into a byte-level sampler at inference time to fix prompt boundary issues and mix models with different tokenizers.

read the letter

The main point is an inference-time conversion that lets you sample byte by byte from a standard token-based model. This removes the prompt boundary problem that distorts output in code and languages like Chinese, and it also unifies vocabularies so you can ensemble or proxy-tune across tokenizers without retraining anything. They release the code, which is useful for testing the claims directly. The approach builds on known tokenization problems but applies the byte-level marginalization in a new way for cross-model work. The examples make clear why the heuristic of avoiding trailing spaces falls short outside English. If the conversion really keeps the original conditional distributions intact, it gives practitioners a clean way to handle generation quality without model changes. The soft spot is the marginalization step itself. To preserve the exact distribution you have to sum over every matching token prefix and keep proper state for partial tokens at each step. The abstract says the method is efficient, but any approximation or state error in the trie or DP would make the byte probabilities diverge from the original model. That would undermine both the PBP fix and the ensembling guarantee. I would want to see runtime measurements and a direct check that the byte-level probabilities match the summed token ones. The paper is aimed at people who run inference on code models or multilingual setups and need to combine off-the-shelf checkpoints. A reader working on generation tricks or deployment would find the technique worth trying. It deserves a serious referee because the problem is real and the method is presented as a practical algorithmic change rather than a loose heuristic.

Referee Report

1 major / 0 minor

Summary. The paper claims to present an inference-time method that converts any autoregressive language model using a BPE tokenizer into a byte-level or character-level model. This approach is said to efficiently solve the Prompt Boundary Problem (PBP) and to unify vocabularies of models with different tokenizers, enabling ensembling at inference time or proxy-tuning for post-training transfer.

Significance. If the byte-level sampling exactly preserves the original model's conditional distributions, the work would provide a practical fix for PBP that impacts code generation and languages like Chinese, where token boundaries do not align with syntactic units. It would also facilitate cross-tokenizer model combinations without retraining. The open availability of code is a strength for verification and extension.

major comments (1)

The central technical claim requires that the byte-level sampler computes the exact marginal probability P(next byte | history) by summing over all tokens whose byte prefix matches the current partial token state. The skeptic's concern is that any mismatch in this marginalization or failure to track ongoing tokenizations would cause the generated distribution to diverge from the original model. Please provide the derivation or pseudocode (e.g., in the algorithm description) showing how the trie or DP maintains exact equivalence classes without approximation.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their careful review and for recognizing the potential impact of our inference-time byte-level sampling approach on the Prompt Boundary Problem and cross-tokenizer unification. We address the major technical comment below and have revised the manuscript to provide the requested clarification.

read point-by-point responses

Referee: The central technical claim requires that the byte-level sampler computes the exact marginal probability P(next byte | history) by summing over all tokens whose byte prefix matches the current partial token state. The skeptic's concern is that any mismatch in this marginalization or failure to track ongoing tokenizations would cause the generated distribution to diverge from the original model. Please provide the derivation or pseudocode (e.g., in the algorithm description) showing how the trie or DP maintains exact equivalence classes without approximation.

Authors: We appreciate the referee's emphasis on verifying exact equivalence. Our byte sampler maintains a trie over the BPE vocabulary and uses dynamic programming to track the set of all active token prefixes consistent with the observed byte history. At each step the probability of the next byte b is computed exactly as the sum of the model's token probabilities for every vocabulary token whose byte string begins with the current prefix plus b, divided by the total probability mass of all tokens consistent with the current prefix. This marginalization is performed without approximation or sampling and preserves the original conditional distribution over byte sequences by construction. We have added a formal derivation in Section 3.2 together with explicit pseudocode (Algorithm 1) in the revised manuscript that illustrates the equivalence-class maintenance. revision: yes

Circularity Check

0 steps flagged

No circularity: new algorithmic conversion technique with no self-referential derivations or fitted predictions

full rationale

The paper presents an inference-time algorithmic method to convert BPE-tokenized LMs to byte-level sampling, solving the Prompt Boundary Problem and enabling vocabulary unification for ensembling or proxy-tuning. The abstract and described claims introduce a new technique without equations, parameter fits, or derivations that reduce to inputs by construction. No self-citation load-bearing steps, uniqueness theorems, or ansatzes are invoked in the provided text to justify core results. The central claim rests on the correctness of the marginalization implementation rather than any definitional equivalence or renamed empirical pattern. This is a standard non-finding for a methods paper whose contribution is procedural rather than deductive.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on standard assumptions about autoregressive factorization and the ability to re-express token probabilities over bytes; no new free parameters or invented entities are introduced in the abstract.

axioms (1)

domain assumption Autoregressive language models factorize probability over tokens that can be re-expressed at the byte level
Implicit in the claim that any BPE model can be converted to byte-level sampling

pith-pipeline@v0.9.0 · 5728 in / 1102 out tokens · 26648 ms · 2026-05-19T09:49:27.508854+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We introduce an efficient procedure to condition a BPE tokenizer-based model on an arbitrary byte-prefix... using the Valid Covering Tree... pairwise validation... Proposition 3.1
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The tree represents exactly the set of valid sequences of tokens with the prompt as a prefix... bounded depth... constant time updates

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

96 extracted references · 96 canonical work pages · 3 internal anchors

[1]

O. Ahia, S. Kumar, H. Gonen, V . Hofmann, T. Limisiewicz, Y . Tsvetkov, and N. A. Smith. MAGNET: Improving the multilingual fairness of language models with adaptive gradient- based tokenization. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URL https://openreview.net/forum?id=1e3MOwHSIX

work page 2024
[2]

Athiwaratkun, S

B. Athiwaratkun, S. Wang, M. Shang, Y . Tian, Z. Wang, S. K. Gonugondla, S. K. Gouda, R. Kwiatkowski, R. Nallapati, P. Bhatia, and B. Xiang. Token alignment via character matching for subword completion. In L.-W. Ku, A. Martins, and V . Srikumar, editors,Findings of the Association for Computational Linguistics: ACL 2024, pages 15725–15738, Bangkok, Thail...

work page doi:10.18653/v1/2024.findings-acl.929 2024
[3]

Berglund and B

M. Berglund and B. van der Merwe. Formalizing bpe tokenization. In 13th International Work- shop on Non-Classical Models of Automata and Applications, NCMA 2023, 18-19 September, 2023, Famagusta, Cyprus, pages 16–27. Open Publishing Association, 2023

work page 2023
[4]

Beyond the imitation game: Quantifying and extrapolating the capabilities of language models

BIG-bench. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. Transactions on Machine Learning Research, 2023. ISSN 2835-8856. URL https://openreview.net/forum?id=uyTL5Bvosj

work page 2023
[5]

T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-V oss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amo...

work page 1901
[6]

Cao and L

K. Cao and L. Rimell. You should evaluate your language model on marginal likelihood over tokenisations. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 2104–2114, 2021. 10

work page 2021
[7]

Y . Chen, K. Marchisio, R. Raileanu, D. Adelani, P. Stenetorp, S. Riedel, and M. Artetxe. Improving language plasticity via pretraining with active forgetting. In Advances in Neural Information Processing Systems. NeurIPS, 2023

work page 2023
[8]

Z. Chen, J. Li, P. Chen, Z. Li, K. Sun, Y . Luo, Q. Mao, D. Yang, H. Sun, and P. S. Yu. Harnessing multiple large language models: A survey on llm ensemble. arXiv preprint arXiv:2502.18036, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[9]

Chirkova, G

N. Chirkova, G. Kruszewski, J. Rozen, and M. Dymetman. Should you marginalize over possible tokenizations? In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 1–12, 2023

work page 2023
[10]

Chizhov, C

P. Chizhov, C. Arnett, E. Korotkova, and I. P. Yamshchikov. Bpe gets picky: Efficient vocabulary refinement during tokenizer training. arXiv preprint arXiv:2409.04599, 2024

work page arXiv 2024
[11]

Chuang, Y

Y .-S. Chuang, Y . Xie, H. Luo, Y . Kim, J. R. Glass, and P. He. Dola: Decoding by contrasting layers improves factuality in large language models. In The Twelfth International Conference on Learning Representations, 2024

work page 2024
[12]

J. H. Clark, D. Garrette, I. Turc, and J. Wieting. Canine: Pre-training an efficient tokenization- free encoder for language representation. Transactions of the Association for Computational Linguistics, 10:73–91, 2022. doi: 10.1162/tacl_a_00448. URL https://aclanthology. org/2022.tacl-1.5

work page doi:10.1162/tacl_a_00448 2022
[13]

Dagan, G

G. Dagan, G. Synnaeve, and B. Rozière. Getting the most out of your tokenizer for pre-training and domain adaptation. In Proceedings of the 41st International Conference on Machine Learning, ICML’24. JMLR.org, 2024

work page 2024
[14]

Dobler and G

K. Dobler and G. De Melo. Focus: Effective embedding initialization for monolingual special- ization of multilingual models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 13440–13454, 2023

work page 2023
[15]

D. Dua, Y . Wang, P. Dasigi, G. Stanovsky, S. Singh, and M. Gardner. DROP: A reading comprehension benchmark requiring discrete reasoning over paragraphs. In J. Burstein, C. Doran, and T. Solorio, editors, Proceedings of the 2019 Conference of the North Amer- ican Chapter of the Association for Computational Linguistics: Human Language Tech- nologies, Vol...

work page doi:10.18653/v1/n19-1246 2019
[16]

P. Gage. A new algorithm for data compression. The C Users Journal archive, 12:23–38, 1994. URLhttps://api.semanticscholar.org/CorpusID:59804030

work page 1994
[17]

L. Gee, A. Zugarini, L. Rigutini, P. Torroni, et al. Fast vocabulary transfer for language model compression. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: Industry Track, pages 409–416. Association for Computational Linguistics (ACL), 2022

work page 2022
[18]

L. Gee, L. Rigutini, M. Ernandes, and A. Zugarini. Multi-word tokenization for sequence compression. In M. Wang and I. Zitouni, editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: Industry Track, pages 612–621, Singapore, Dec. 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-industry

work page doi:10.18653/v1/2023.emnlp-industry 2023
[19]

URL https://aclanthology.org/2023.emnlp-industry.58

work page 2023
[20]

R. L. Geh, H. Zhang, K. Ahmed, B. Wang, and G. V . d. Broeck. Where is the signal in tokenization space? arXiv preprint arXiv:2408.08541, 2024

work page arXiv 2024
[21]

A. Gera, R. Friedman, O. Arviv, C. Gunasekara, B. Sznajder, N. Slonim, and E. Shnarch. The benefits of bad advice: Autocontrastive decoding across model layers. In A. Rogers, J. Boyd- Graber, and N. Okazaki, editors, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 10406–10420, Toronto,...

work page doi:10.18653/v1/2023.acl-long.580 2023
[22]

Holtzman, J

A. Holtzman, J. Buys, L. Du, M. Forbes, and Y . Choi. The curious case of neural text degeneration. In International Conference on Learning Representations, 2020

work page 2020
[23]

Huang, X

Y . Huang, X. Feng, B. Li, Y . Xiang, H. Wang, T. Liu, and B. Qin. Ensemble learning for heterogeneous large language models with deep parallel collaboration. Advances in Neural Information Processing Systems, 37:119838–119860, 2024

work page 2024
[24]

J. Jackson. Character prefix conditioning, 2025. URL https://www.cursor.com/blog/cpc

work page 2025
[25]

TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension

M. Joshi, E. Choi, D. Weld, and L. Zettlemoyer. TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension. In R. Barzilay and M.-Y . Kan, editors,Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1601–1611, Vancouver, Canada, July 2017. Association for Comp...

work page doi:10.18653/v1/p17-1147 2017
[26]

Kasai, K

J. Kasai, K. Sakaguchi, R. Le Bras, H. Peng, X. Lu, D. Radev, Y . Choi, and N. A. Smith. Twist decoding: Diverse generators guide each other. In Y . Goldberg, Z. Kozareva, and Y . Zhang, editors, Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 4909–4923, Abu Dhabi, United Arab Emirates, Dec. 2022. Association ...

work page doi:10.18653/v1/2022.emnlp-main.326 2022
[27]

T. Kudo. Subword regularization: Improving neural network translation models with multi- ple subword candidates. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 66–75, 2018

work page 2018
[28]

Kudo and J

T. Kudo and J. Richardson. Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 66–71, 2018

work page 2018
[29]

Kudugunta, I

S. Kudugunta, I. Caswell, B. Zhang, X. Garcia, D. Xin, A. Kusupati, R. Stella, A. Bapna, and O. Firat. Madlad-400: A multilingual and document-level large audited dataset. Advances in Neural Information Processing Systems, 36:67284–67296, 2023

work page 2023
[30]

Kumar and A

D. Kumar and A. Thawani. BPE beyond word boundary: How NOT to use multi word expres- sions in neural machine translation. In S. Tafreshi, J. Sedoc, A. Rogers, A. Drozd, A. Rumshisky, and A. Akula, editors, Proceedings of the Third Workshop on Insights from Negative Results in NLP, pages 172–179, Dublin, Ireland, May 2022. Association for Computational Lin...

work page doi:10.18653/v1/2022.insights-1.24 2022
[31]

Lambert, J

N. Lambert, J. Morrison, V . Pyatkin, S. Huang, H. Ivison, F. Brahman, L. J. V . Miranda, A. Liu, N. Dziri, S. Lyu, Y . Gu, S. Malik, V . Graf, J. D. Hwang, J. Yang, R. L. Bras, O. Tafjord, C. Wilhelm, L. Soldaini, N. A. Smith, Y . Wang, P. Dasigi, and H. Hajishirzi. Tulu 3: Pushing frontiers in open language model post-training, 2025. URL https://arxiv.o...

work page 2025
[32]

X. L. Li, A. Holtzman, D. Fried, P. Liang, J. Eisner, T. Hashimoto, L. Zettlemoyer, and M. Lewis. Contrastive decoding: Open-ended text generation as optimization. In A. Rogers, J. Boyd- Graber, and N. Okazaki, editors, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12286–12312, Toron...

work page doi:10.18653/v1/2023.acl-long.687 2023
[33]

A. Liu, M. Sap, X. Lu, S. Swayamdipta, C. Bhagavatula, N. A. Smith, and Y . Choi. DEx- perts: Decoding-time controlled text generation with experts and anti-experts. In C. Zong, F. Xia, W. Li, and R. Navigli, editors, Proceedings of the 59th Annual Meeting of the As- sociation for Computational Linguistics and the 11th International Joint Conference on Na...

work page
[34]

doi: 10.18653/v1/2021.acl-long.522

Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-long.522. URL https://aclanthology.org/2021.acl-long.522. 12

work page doi:10.18653/v1/2021.acl-long.522 2021
[35]

A. Liu, X. Han, Y . Wang, Y . Tsvetkov, Y . Choi, and N. A. Smith. Tuning language models by proxy. In First Conference on Language Modeling, 2024

work page 2024
[36]

A. Liu, J. Hayase, V . Hofmann, S. Oh, N. A. Smith, and Y . Choi. SuperBPE: Space travel for language models. arXiv preprint arXiv:2503.13423, 2025. URL https://arxiv.org/abs/ 2503.13423

work page arXiv 2025
[37]

C. Liu, X. Quan, Y . Pan, L. Lin, W. Wu, and X. Chen. Cool-fusion: Fuse large language models without training. arXiv preprint arXiv:2407.19807, 2024

work page arXiv 2024
[38]

Y . Liu, P. Lin, M. Wang, and H. Schütze. Ofa: A framework of initializing unseen subword embeddings for efficient large-scale multilingual continued pretraining. In Findings of the Association for Computational Linguistics: NAACL 2024, pages 1067–1097, 2024

work page 2024
[39]

Lundberg

S. Lundberg. The art of prompt design: Prompt boundaries and token heal- ing, 2023. URL https://medium.com/towards-data-science/the-art-of-prompt- design-prompt-boundaries-and-token-healing-3b2448b0be38

work page 2023
[40]

B. Lv, C. Tang, Y . Zhang, X. Liu, Y . Yu, and P. Luo. Specfuse: Ensembling large language models via next-segment prediction. arXiv preprint arXiv:2412.07380, 2024

work page arXiv 2024
[41]

Marchisio, P

K. Marchisio, P. Lewis, Y . Chen, and M. Artetxe. Mini-model adaptation: Efficiently extending pretrained models to new languages via aligned shallow training. In The 61st Annual Meeting Of The Association For Computational Linguistics, 2023

work page 2023
[42]

Mavromatis, P

C. Mavromatis, P. Karypis, and G. Karypis. Pack of llms: Model fusion at test-time via perplexity optimization. In First Conference on Language Modeling, 2024

work page 2024
[43]

S. J. Mielke. Can you compare perplexity across different segmentations?, Apr 2019. URL https://sjmielke.com/comparing-perplexities.htm

work page 2019
[44]

Minixhofer, F

B. Minixhofer, F. Paischer, and N. Rekabsaz. Wechsel: Effective initialization of subword embeddings for cross-lingual transfer of monolingual language models. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3992–4006, 2022

work page 2022
[45]

Minixhofer, E

B. Minixhofer, E. Ponti, and I. Vuli´c. Zero-shot tokenizer transfer. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

work page 2024
[46]

Minixhofer, I

B. Minixhofer, I. Vuli´c, and E. M. Ponti. Universal cross-tokenizer distillation via approximate likelihood matching. arXiv preprint arXiv:2503.20083, 2025

work page arXiv 2025
[47]

Nawrot, J

P. Nawrot, J. Chorowski, A. Lancucki, and E. M. Ponti. Efficient transformers with dynamic token pooling. In A. Rogers, J. Boyd-Graber, and N. Okazaki, editors, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages 6403–6417, Toronto, Canada, July 2023. Association for Computational Linguis...

work page doi:10.18653/v1/2023.acl-long.353 2023
[48]

Oh and W

B.-D. Oh and W. Schuler. Leading whitespaces of language models’ subword vocabulary pose a confound for calculating word probabilities. In Y . Al-Onaizan, M. Bansal, and Y .-N. Chen, editors, Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 3464–3472, Miami, Florida, USA, Nov. 2024. Association for Computationa...

work page doi:10.18653/v1/2024.emnlp-main.202 2024
[49]

T. OLMo, P. Walsh, L. Soldaini, D. Groeneveld, K. Lo, S. Arora, A. Bhagia, Y . Gu, S. Huang, M. Jordan, N. Lambert, D. Schwenk, O. Tafjord, T. Anderson, D. Atkinson, F. Brahman, C. Clark, P. Dasigi, N. Dziri, M. Guerquin, H. Ivison, P. W. Koh, J. Liu, S. Malik, W. Merrill, L. J. V . Miranda, J. Morrison, T. Murray, C. Nam, V . Pyatkin, A. Rangapur, M. Sch...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[50]

Openai platform documentation, 2023

OpenAI. Openai platform documentation, 2023. URL https://platform.openai.com/ docs. Accessed: 2025/05/10. 13

work page 2023
[51]

Pagnoni, R

A. Pagnoni, R. Pasunuru, P. Rodriguez, J. Nguyen, B. Muller, M. Li, C. Zhou, L. Yu, J. Weston, L. Zettlemoyer, G. Ghosh, M. Lewis, A. Holtzman, and S. Iyer. Byte latent transformer: Patches scale better than tokens, 2024. URL https://arxiv.org/abs/2412.09871

work page arXiv 2024
[52]

Paperno, G

D. Paperno, G. Kruszewski, A. Lazaridou, N. Q. Pham, R. Bernardi, S. Pezzelle, M. Baroni, G. Boleda, and R. Fernández. The LAMBADA dataset: Word prediction requiring a broad discourse context. In K. Erk and N. A. Smith, editors, Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1525–1534...

work page doi:10.18653/v1/p16- 2016
[53]

URL https://aclanthology.org/P16-1144

work page
[54]

B. Phan, M. Havasi, M. Muckley, and K. Ullrich. Understanding and mitigating tokenization bias in language models, 2024. URL https://arxiv.org/abs/2406.16829

work page arXiv 2024
[55]

Pimentel and C

T. Pimentel and C. Meister. How to compute the probability of a word. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 18358–18375, 2024

work page 2024
[56]

Provilkov, D

I. Provilkov, D. Emelianenko, and E. V oita. Bpe-dropout: Simple and effective subword regularization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 2020

work page 2020
[57]

Rajpurkar, J

P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang. SQuAD: 100,000+ questions for machine comprehension of text. In J. Su, K. Duh, and X. Carreras, editors, Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2383–2392, Austin, Texas, Nov. 2016. Association for Computational Linguistics. doi: 10.18653/v1/D16-1264. U...

work page doi:10.18653/v1/d16-1264 2016
[58]

M. T. Ribeiro. A guidance language for controlling large language models, 2023. URL https: //github.com/guidance-ai/guidance?tab=readme-ov-file#text-not-tokens

work page 2023
[59]

R. S. 4d masks support in transformers, 2024. URL https://huggingface.co/blog/ poedator/4d-masks

work page 2024
[60]

Schuster and K

M. Schuster and K. Nakajima. Japanese and korean voice search. In 2012 IEEE international conference on acoustics, speech and signal processing (ICASSP) , pages 5149–5152. IEEE, 2012

work page 2012
[61]

Sennrich, B

R. Sennrich, B. Haddow, and A. Birch. Neural machine translation of rare words with subword units. In K. Erk and N. A. Smith, editors, Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1715–1725, Berlin, Germany, Aug. 2016. Association for Computational Linguistics. doi: 10.18653/v1/P16-...

work page doi:10.18653/v1/p16-1162 2016
[62]

R. Shi, Y . Chen, Y . Hu, A. Liu, H. Hajishirzi, N. A. Smith, and S. S. Du. Decoding-time language model alignment with multiple objectives. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

work page 2024
[63]

W. Shi, X. Han, M. Lewis, Y . Tsvetkov, L. Zettlemoyer, and W.-t. Yih. Trusting your evidence: Hallucinate less with context-aware decoding. In K. Duh, H. Gomez, and S. Bethard, editors, Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers), p...

work page doi:10.18653/v1/2024.naacl-short.69 2024
[64]

Y . Tay, V . Q. Tran, S. Ruder, J. Gupta, H. W. Chung, D. Bahri, Z. Qin, S. Baumgartner, C. Yu, and D. Metzler. Charformer: Fast character transformers via gradient-based subword tokenization. In International Conference on Learning Representations, 2022. URL https: //openreview.net/forum?id=JtBRnrlOEFN

work page 2022
[65]

L. Team. The llama 3 herd of models, 2024. URL https://arxiv.org/abs/2407.21783. 14

work page internal anchor Pith review Pith/arXiv arXiv 2024
[66]

L. Team. Introducing llama 3.1: Our most capable models to date, 2024. URL https: //ai.meta.com/blog/meta-llama-3-1/

work page 2024
[67]

L. Team. Llama 3.2: Revolutionizing edge ai and vision with open, customizable models, 2024. URLhttps://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile- devices/. Accessed: 2025/05/10

work page 2024
[68]

O. Team. Olmo release notes, 2025. URL https://allenai.org/olmo/release-notes# olmo-2-1b. Accessed: 2025/05/10

work page 2025
[69]

Q. Team. Qwen3: Think deeper, act faster, 2025. URL https://qwenlm.github.io/blog/ qwen3/. Accessed: 2025/05/10

work page 2025
[70]

K. Tran. From english to foreign languages: Transferring pre-trained language models. arXiv preprint arXiv:2002.07306, 2020

work page arXiv 2002
[71]

B. Tunguz. 200,000+ jeopardy! questions, 1019. URL https://www.kaggle.com/ datasets/tunguz/200000-jeopardy-questions

work page
[72]

B. Tunguz. 200,000+ jeopardy! questions, 2019. URL https://www.kaggle.com/ datasets/tunguz/200000-jeopardy-questions

work page 2019
[73]

A. Turaga. Character prefix conditioning with back tokenization, 2025. URL https:// anilturaga.github.io/cpc

work page 2025
[74]

van Antwerpen and A

H. van Antwerpen and A. Neubeck. So many tokens, so little time: Introducing a faster, more flexible byte-pair tokenizer, 2025. URL https://github.blog/ai-and-ml/llms/so- many-tokens-so-little-time-introducing-a-faster-more-flexible-byte- pair-tokenizer/. Accessed: 2025/05/10

work page 2025
[75]

Vieira, B

T. Vieira, B. LeBrun, M. Giulianelli, J. L. Gastaldi, B. DuSell, J. Terilla, T. J. O’Donnell, and R. Cotterell. From language models over tokens to language models over characters. arXiv preprint arXiv:2412.03719, 2024

work page arXiv 2024
[76]

Vieira, T

T. Vieira, T. Liu, C. Pasti, Y . Emara, B. DuSell, B. LeBrun, M. Giulianelli, J. L. Gastaldi, T. J. O’Donnell, and R. Cotterell. Language models over canonical byte-pair encodings. arXiv preprint arXiv:2506.07956, 2025

work page arXiv 2025
[77]

J. Wang, T. Gangavarapu, J. N. Yan, and A. M. Rush. Mambabyte: Token-free selective state space model. In First Conference on Language Modeling, 2024. URL https://openreview. net/forum?id=X1xNsuKssb

work page 2024
[78]

Y . Xu, J. Chen, J. Wu, and J. Zhang. Hit the sweet spot! span-level ensemble for large language models. arXiv preprint arXiv:2409.18583, 2024

work page arXiv 2024
[79]

Y . Xu, J. Lu, and J. Zhang. Bridging the gap between different vocabularies for llm ensemble. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 7133–7145, 2024

work page 2024
[80]

Y . Xu, J. Chen, J. Wu, and J. Zhang. Hit the sweet spot! span-level ensemble for large language models. In Proceedings of the 31st International Conference on Computational Linguistics, pages 8314–8325, 2025

work page 2025

Showing first 80 references.

[1] [1]

O. Ahia, S. Kumar, H. Gonen, V . Hofmann, T. Limisiewicz, Y . Tsvetkov, and N. A. Smith. MAGNET: Improving the multilingual fairness of language models with adaptive gradient- based tokenization. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URL https://openreview.net/forum?id=1e3MOwHSIX

work page 2024

[2] [2]

Athiwaratkun, S

B. Athiwaratkun, S. Wang, M. Shang, Y . Tian, Z. Wang, S. K. Gonugondla, S. K. Gouda, R. Kwiatkowski, R. Nallapati, P. Bhatia, and B. Xiang. Token alignment via character matching for subword completion. In L.-W. Ku, A. Martins, and V . Srikumar, editors,Findings of the Association for Computational Linguistics: ACL 2024, pages 15725–15738, Bangkok, Thail...

work page doi:10.18653/v1/2024.findings-acl.929 2024

[3] [3]

Berglund and B

M. Berglund and B. van der Merwe. Formalizing bpe tokenization. In 13th International Work- shop on Non-Classical Models of Automata and Applications, NCMA 2023, 18-19 September, 2023, Famagusta, Cyprus, pages 16–27. Open Publishing Association, 2023

work page 2023

[4] [4]

Beyond the imitation game: Quantifying and extrapolating the capabilities of language models

BIG-bench. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. Transactions on Machine Learning Research, 2023. ISSN 2835-8856. URL https://openreview.net/forum?id=uyTL5Bvosj

work page 2023

[5] [5]

T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-V oss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amo...

work page 1901

[6] [6]

Cao and L

K. Cao and L. Rimell. You should evaluate your language model on marginal likelihood over tokenisations. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 2104–2114, 2021. 10

work page 2021

[7] [7]

Y . Chen, K. Marchisio, R. Raileanu, D. Adelani, P. Stenetorp, S. Riedel, and M. Artetxe. Improving language plasticity via pretraining with active forgetting. In Advances in Neural Information Processing Systems. NeurIPS, 2023

work page 2023

[8] [8]

Z. Chen, J. Li, P. Chen, Z. Li, K. Sun, Y . Luo, Q. Mao, D. Yang, H. Sun, and P. S. Yu. Harnessing multiple large language models: A survey on llm ensemble. arXiv preprint arXiv:2502.18036, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[9] [9]

Chirkova, G

N. Chirkova, G. Kruszewski, J. Rozen, and M. Dymetman. Should you marginalize over possible tokenizations? In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 1–12, 2023

work page 2023

[10] [10]

Chizhov, C

P. Chizhov, C. Arnett, E. Korotkova, and I. P. Yamshchikov. Bpe gets picky: Efficient vocabulary refinement during tokenizer training. arXiv preprint arXiv:2409.04599, 2024

work page arXiv 2024

[11] [11]

Chuang, Y

Y .-S. Chuang, Y . Xie, H. Luo, Y . Kim, J. R. Glass, and P. He. Dola: Decoding by contrasting layers improves factuality in large language models. In The Twelfth International Conference on Learning Representations, 2024

work page 2024

[12] [12]

J. H. Clark, D. Garrette, I. Turc, and J. Wieting. Canine: Pre-training an efficient tokenization- free encoder for language representation. Transactions of the Association for Computational Linguistics, 10:73–91, 2022. doi: 10.1162/tacl_a_00448. URL https://aclanthology. org/2022.tacl-1.5

work page doi:10.1162/tacl_a_00448 2022

[13] [13]

Dagan, G

G. Dagan, G. Synnaeve, and B. Rozière. Getting the most out of your tokenizer for pre-training and domain adaptation. In Proceedings of the 41st International Conference on Machine Learning, ICML’24. JMLR.org, 2024

work page 2024

[14] [14]

Dobler and G

K. Dobler and G. De Melo. Focus: Effective embedding initialization for monolingual special- ization of multilingual models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 13440–13454, 2023

work page 2023

[15] [15]

D. Dua, Y . Wang, P. Dasigi, G. Stanovsky, S. Singh, and M. Gardner. DROP: A reading comprehension benchmark requiring discrete reasoning over paragraphs. In J. Burstein, C. Doran, and T. Solorio, editors, Proceedings of the 2019 Conference of the North Amer- ican Chapter of the Association for Computational Linguistics: Human Language Tech- nologies, Vol...

work page doi:10.18653/v1/n19-1246 2019

[16] [16]

P. Gage. A new algorithm for data compression. The C Users Journal archive, 12:23–38, 1994. URLhttps://api.semanticscholar.org/CorpusID:59804030

work page 1994

[17] [17]

L. Gee, A. Zugarini, L. Rigutini, P. Torroni, et al. Fast vocabulary transfer for language model compression. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: Industry Track, pages 409–416. Association for Computational Linguistics (ACL), 2022

work page 2022

[18] [18]

L. Gee, L. Rigutini, M. Ernandes, and A. Zugarini. Multi-word tokenization for sequence compression. In M. Wang and I. Zitouni, editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: Industry Track, pages 612–621, Singapore, Dec. 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-industry

work page doi:10.18653/v1/2023.emnlp-industry 2023

[19] [19]

URL https://aclanthology.org/2023.emnlp-industry.58

work page 2023

[20] [20]

R. L. Geh, H. Zhang, K. Ahmed, B. Wang, and G. V . d. Broeck. Where is the signal in tokenization space? arXiv preprint arXiv:2408.08541, 2024

work page arXiv 2024

[21] [21]

A. Gera, R. Friedman, O. Arviv, C. Gunasekara, B. Sznajder, N. Slonim, and E. Shnarch. The benefits of bad advice: Autocontrastive decoding across model layers. In A. Rogers, J. Boyd- Graber, and N. Okazaki, editors, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 10406–10420, Toronto,...

work page doi:10.18653/v1/2023.acl-long.580 2023

[22] [22]

Holtzman, J

A. Holtzman, J. Buys, L. Du, M. Forbes, and Y . Choi. The curious case of neural text degeneration. In International Conference on Learning Representations, 2020

work page 2020

[23] [23]

Huang, X

Y . Huang, X. Feng, B. Li, Y . Xiang, H. Wang, T. Liu, and B. Qin. Ensemble learning for heterogeneous large language models with deep parallel collaboration. Advances in Neural Information Processing Systems, 37:119838–119860, 2024

work page 2024

[24] [24]

J. Jackson. Character prefix conditioning, 2025. URL https://www.cursor.com/blog/cpc

work page 2025

[25] [25]

TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension

M. Joshi, E. Choi, D. Weld, and L. Zettlemoyer. TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension. In R. Barzilay and M.-Y . Kan, editors,Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1601–1611, Vancouver, Canada, July 2017. Association for Comp...

work page doi:10.18653/v1/p17-1147 2017

[26] [26]

Kasai, K

J. Kasai, K. Sakaguchi, R. Le Bras, H. Peng, X. Lu, D. Radev, Y . Choi, and N. A. Smith. Twist decoding: Diverse generators guide each other. In Y . Goldberg, Z. Kozareva, and Y . Zhang, editors, Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 4909–4923, Abu Dhabi, United Arab Emirates, Dec. 2022. Association ...

work page doi:10.18653/v1/2022.emnlp-main.326 2022

[27] [27]

T. Kudo. Subword regularization: Improving neural network translation models with multi- ple subword candidates. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 66–75, 2018

work page 2018

[28] [28]

Kudo and J

T. Kudo and J. Richardson. Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 66–71, 2018

work page 2018

[29] [29]

Kudugunta, I

S. Kudugunta, I. Caswell, B. Zhang, X. Garcia, D. Xin, A. Kusupati, R. Stella, A. Bapna, and O. Firat. Madlad-400: A multilingual and document-level large audited dataset. Advances in Neural Information Processing Systems, 36:67284–67296, 2023

work page 2023

[30] [30]

Kumar and A

D. Kumar and A. Thawani. BPE beyond word boundary: How NOT to use multi word expres- sions in neural machine translation. In S. Tafreshi, J. Sedoc, A. Rogers, A. Drozd, A. Rumshisky, and A. Akula, editors, Proceedings of the Third Workshop on Insights from Negative Results in NLP, pages 172–179, Dublin, Ireland, May 2022. Association for Computational Lin...

work page doi:10.18653/v1/2022.insights-1.24 2022

[31] [31]

Lambert, J

N. Lambert, J. Morrison, V . Pyatkin, S. Huang, H. Ivison, F. Brahman, L. J. V . Miranda, A. Liu, N. Dziri, S. Lyu, Y . Gu, S. Malik, V . Graf, J. D. Hwang, J. Yang, R. L. Bras, O. Tafjord, C. Wilhelm, L. Soldaini, N. A. Smith, Y . Wang, P. Dasigi, and H. Hajishirzi. Tulu 3: Pushing frontiers in open language model post-training, 2025. URL https://arxiv.o...

work page 2025

[32] [32]

X. L. Li, A. Holtzman, D. Fried, P. Liang, J. Eisner, T. Hashimoto, L. Zettlemoyer, and M. Lewis. Contrastive decoding: Open-ended text generation as optimization. In A. Rogers, J. Boyd- Graber, and N. Okazaki, editors, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12286–12312, Toron...

work page doi:10.18653/v1/2023.acl-long.687 2023

[33] [33]

A. Liu, M. Sap, X. Lu, S. Swayamdipta, C. Bhagavatula, N. A. Smith, and Y . Choi. DEx- perts: Decoding-time controlled text generation with experts and anti-experts. In C. Zong, F. Xia, W. Li, and R. Navigli, editors, Proceedings of the 59th Annual Meeting of the As- sociation for Computational Linguistics and the 11th International Joint Conference on Na...

work page

[34] [34]

doi: 10.18653/v1/2021.acl-long.522

Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-long.522. URL https://aclanthology.org/2021.acl-long.522. 12

work page doi:10.18653/v1/2021.acl-long.522 2021

[35] [35]

A. Liu, X. Han, Y . Wang, Y . Tsvetkov, Y . Choi, and N. A. Smith. Tuning language models by proxy. In First Conference on Language Modeling, 2024

work page 2024

[36] [36]

A. Liu, J. Hayase, V . Hofmann, S. Oh, N. A. Smith, and Y . Choi. SuperBPE: Space travel for language models. arXiv preprint arXiv:2503.13423, 2025. URL https://arxiv.org/abs/ 2503.13423

work page arXiv 2025

[37] [37]

C. Liu, X. Quan, Y . Pan, L. Lin, W. Wu, and X. Chen. Cool-fusion: Fuse large language models without training. arXiv preprint arXiv:2407.19807, 2024

work page arXiv 2024

[38] [38]

Y . Liu, P. Lin, M. Wang, and H. Schütze. Ofa: A framework of initializing unseen subword embeddings for efficient large-scale multilingual continued pretraining. In Findings of the Association for Computational Linguistics: NAACL 2024, pages 1067–1097, 2024

work page 2024

[39] [39]

Lundberg

S. Lundberg. The art of prompt design: Prompt boundaries and token heal- ing, 2023. URL https://medium.com/towards-data-science/the-art-of-prompt- design-prompt-boundaries-and-token-healing-3b2448b0be38

work page 2023

[40] [40]

B. Lv, C. Tang, Y . Zhang, X. Liu, Y . Yu, and P. Luo. Specfuse: Ensembling large language models via next-segment prediction. arXiv preprint arXiv:2412.07380, 2024

work page arXiv 2024

[41] [41]

Marchisio, P

K. Marchisio, P. Lewis, Y . Chen, and M. Artetxe. Mini-model adaptation: Efficiently extending pretrained models to new languages via aligned shallow training. In The 61st Annual Meeting Of The Association For Computational Linguistics, 2023

work page 2023

[42] [42]

Mavromatis, P

C. Mavromatis, P. Karypis, and G. Karypis. Pack of llms: Model fusion at test-time via perplexity optimization. In First Conference on Language Modeling, 2024

work page 2024

[43] [43]

S. J. Mielke. Can you compare perplexity across different segmentations?, Apr 2019. URL https://sjmielke.com/comparing-perplexities.htm

work page 2019

[44] [44]

Minixhofer, F

B. Minixhofer, F. Paischer, and N. Rekabsaz. Wechsel: Effective initialization of subword embeddings for cross-lingual transfer of monolingual language models. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3992–4006, 2022

work page 2022

[45] [45]

Minixhofer, E

B. Minixhofer, E. Ponti, and I. Vuli´c. Zero-shot tokenizer transfer. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

work page 2024

[46] [46]

Minixhofer, I

B. Minixhofer, I. Vuli´c, and E. M. Ponti. Universal cross-tokenizer distillation via approximate likelihood matching. arXiv preprint arXiv:2503.20083, 2025

work page arXiv 2025

[47] [47]

Nawrot, J

P. Nawrot, J. Chorowski, A. Lancucki, and E. M. Ponti. Efficient transformers with dynamic token pooling. In A. Rogers, J. Boyd-Graber, and N. Okazaki, editors, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages 6403–6417, Toronto, Canada, July 2023. Association for Computational Linguis...

work page doi:10.18653/v1/2023.acl-long.353 2023

[48] [48]

Oh and W

B.-D. Oh and W. Schuler. Leading whitespaces of language models’ subword vocabulary pose a confound for calculating word probabilities. In Y . Al-Onaizan, M. Bansal, and Y .-N. Chen, editors, Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 3464–3472, Miami, Florida, USA, Nov. 2024. Association for Computationa...

work page doi:10.18653/v1/2024.emnlp-main.202 2024

[49] [49]

T. OLMo, P. Walsh, L. Soldaini, D. Groeneveld, K. Lo, S. Arora, A. Bhagia, Y . Gu, S. Huang, M. Jordan, N. Lambert, D. Schwenk, O. Tafjord, T. Anderson, D. Atkinson, F. Brahman, C. Clark, P. Dasigi, N. Dziri, M. Guerquin, H. Ivison, P. W. Koh, J. Liu, S. Malik, W. Merrill, L. J. V . Miranda, J. Morrison, T. Murray, C. Nam, V . Pyatkin, A. Rangapur, M. Sch...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[50] [50]

Openai platform documentation, 2023

OpenAI. Openai platform documentation, 2023. URL https://platform.openai.com/ docs. Accessed: 2025/05/10. 13

work page 2023

[51] [51]

Pagnoni, R

A. Pagnoni, R. Pasunuru, P. Rodriguez, J. Nguyen, B. Muller, M. Li, C. Zhou, L. Yu, J. Weston, L. Zettlemoyer, G. Ghosh, M. Lewis, A. Holtzman, and S. Iyer. Byte latent transformer: Patches scale better than tokens, 2024. URL https://arxiv.org/abs/2412.09871

work page arXiv 2024

[52] [52]

Paperno, G

D. Paperno, G. Kruszewski, A. Lazaridou, N. Q. Pham, R. Bernardi, S. Pezzelle, M. Baroni, G. Boleda, and R. Fernández. The LAMBADA dataset: Word prediction requiring a broad discourse context. In K. Erk and N. A. Smith, editors, Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1525–1534...

work page doi:10.18653/v1/p16- 2016

[53] [53]

URL https://aclanthology.org/P16-1144

work page

[54] [54]

B. Phan, M. Havasi, M. Muckley, and K. Ullrich. Understanding and mitigating tokenization bias in language models, 2024. URL https://arxiv.org/abs/2406.16829

work page arXiv 2024

[55] [55]

Pimentel and C

T. Pimentel and C. Meister. How to compute the probability of a word. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 18358–18375, 2024

work page 2024

[56] [56]

Provilkov, D

I. Provilkov, D. Emelianenko, and E. V oita. Bpe-dropout: Simple and effective subword regularization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 2020

work page 2020

[57] [57]

Rajpurkar, J

P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang. SQuAD: 100,000+ questions for machine comprehension of text. In J. Su, K. Duh, and X. Carreras, editors, Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2383–2392, Austin, Texas, Nov. 2016. Association for Computational Linguistics. doi: 10.18653/v1/D16-1264. U...

work page doi:10.18653/v1/d16-1264 2016

[58] [58]

M. T. Ribeiro. A guidance language for controlling large language models, 2023. URL https: //github.com/guidance-ai/guidance?tab=readme-ov-file#text-not-tokens

work page 2023

[59] [59]

R. S. 4d masks support in transformers, 2024. URL https://huggingface.co/blog/ poedator/4d-masks

work page 2024

[60] [60]

Schuster and K

M. Schuster and K. Nakajima. Japanese and korean voice search. In 2012 IEEE international conference on acoustics, speech and signal processing (ICASSP) , pages 5149–5152. IEEE, 2012

work page 2012

[61] [61]

Sennrich, B

R. Sennrich, B. Haddow, and A. Birch. Neural machine translation of rare words with subword units. In K. Erk and N. A. Smith, editors, Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1715–1725, Berlin, Germany, Aug. 2016. Association for Computational Linguistics. doi: 10.18653/v1/P16-...

work page doi:10.18653/v1/p16-1162 2016

[62] [62]

R. Shi, Y . Chen, Y . Hu, A. Liu, H. Hajishirzi, N. A. Smith, and S. S. Du. Decoding-time language model alignment with multiple objectives. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

work page 2024

[63] [63]

W. Shi, X. Han, M. Lewis, Y . Tsvetkov, L. Zettlemoyer, and W.-t. Yih. Trusting your evidence: Hallucinate less with context-aware decoding. In K. Duh, H. Gomez, and S. Bethard, editors, Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers), p...

work page doi:10.18653/v1/2024.naacl-short.69 2024

[64] [64]

Y . Tay, V . Q. Tran, S. Ruder, J. Gupta, H. W. Chung, D. Bahri, Z. Qin, S. Baumgartner, C. Yu, and D. Metzler. Charformer: Fast character transformers via gradient-based subword tokenization. In International Conference on Learning Representations, 2022. URL https: //openreview.net/forum?id=JtBRnrlOEFN

work page 2022

[65] [65]

L. Team. The llama 3 herd of models, 2024. URL https://arxiv.org/abs/2407.21783. 14

work page internal anchor Pith review Pith/arXiv arXiv 2024

[66] [66]

L. Team. Introducing llama 3.1: Our most capable models to date, 2024. URL https: //ai.meta.com/blog/meta-llama-3-1/

work page 2024

[67] [67]

L. Team. Llama 3.2: Revolutionizing edge ai and vision with open, customizable models, 2024. URLhttps://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile- devices/. Accessed: 2025/05/10

work page 2024

[68] [68]

O. Team. Olmo release notes, 2025. URL https://allenai.org/olmo/release-notes# olmo-2-1b. Accessed: 2025/05/10

work page 2025

[69] [69]

Q. Team. Qwen3: Think deeper, act faster, 2025. URL https://qwenlm.github.io/blog/ qwen3/. Accessed: 2025/05/10

work page 2025

[70] [70]

K. Tran. From english to foreign languages: Transferring pre-trained language models. arXiv preprint arXiv:2002.07306, 2020

work page arXiv 2002

[71] [71]

B. Tunguz. 200,000+ jeopardy! questions, 1019. URL https://www.kaggle.com/ datasets/tunguz/200000-jeopardy-questions

work page

[72] [72]

B. Tunguz. 200,000+ jeopardy! questions, 2019. URL https://www.kaggle.com/ datasets/tunguz/200000-jeopardy-questions

work page 2019

[73] [73]

A. Turaga. Character prefix conditioning with back tokenization, 2025. URL https:// anilturaga.github.io/cpc

work page 2025

[74] [74]

van Antwerpen and A

H. van Antwerpen and A. Neubeck. So many tokens, so little time: Introducing a faster, more flexible byte-pair tokenizer, 2025. URL https://github.blog/ai-and-ml/llms/so- many-tokens-so-little-time-introducing-a-faster-more-flexible-byte- pair-tokenizer/. Accessed: 2025/05/10

work page 2025

[75] [75]

Vieira, B

T. Vieira, B. LeBrun, M. Giulianelli, J. L. Gastaldi, B. DuSell, J. Terilla, T. J. O’Donnell, and R. Cotterell. From language models over tokens to language models over characters. arXiv preprint arXiv:2412.03719, 2024

work page arXiv 2024

[76] [76]

Vieira, T

T. Vieira, T. Liu, C. Pasti, Y . Emara, B. DuSell, B. LeBrun, M. Giulianelli, J. L. Gastaldi, T. J. O’Donnell, and R. Cotterell. Language models over canonical byte-pair encodings. arXiv preprint arXiv:2506.07956, 2025

work page arXiv 2025

[77] [77]

J. Wang, T. Gangavarapu, J. N. Yan, and A. M. Rush. Mambabyte: Token-free selective state space model. In First Conference on Language Modeling, 2024. URL https://openreview. net/forum?id=X1xNsuKssb

work page 2024

[78] [78]

Y . Xu, J. Chen, J. Wu, and J. Zhang. Hit the sweet spot! span-level ensemble for large language models. arXiv preprint arXiv:2409.18583, 2024

work page arXiv 2024

[79] [79]

Y . Xu, J. Lu, and J. Zhang. Bridging the gap between different vocabularies for llm ensemble. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 7133–7145, 2024

work page 2024

[80] [80]

Y . Xu, J. Chen, J. Wu, and J. Zhang. Hit the sweet spot! span-level ensemble for large language models. In Proceedings of the 31st International Conference on Computational Linguistics, pages 8314–8325, 2025

work page 2025