pith. sign in

arxiv: 2506.14123 · v3 · submitted 2025-06-17 · 💻 cs.CL · cs.FL· cs.LG

Sampling from Your Language Model One Byte at a Time

Pith reviewed 2026-05-19 09:49 UTC · model grok-4.3

classification 💻 cs.CL cs.FLcs.LG
keywords prompt boundary problembyte-level samplingBPE tokenizerinference-time conversionlanguage model ensemblingproxy-tuningtokenization distortionautoregressive sampling
0
0 comments X p. Extension

The pith

A new inference-time method converts any BPE language model into a byte-level sampler that eliminates the prompt boundary problem.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a technique that turns any autoregressive language model using a BPE tokenizer into an equivalent character-level or byte-level model during sampling. This directly addresses the Prompt Boundary Problem, where token boundaries distort model outputs for prompts ending in spaces or for languages and code where tokens do not align with natural boundaries. The same conversion unifies vocabularies across models with different tokenizers. As a result, models can be ensembled at inference time or have post-training transferred between them through proxy-tuning. The approach is designed to run efficiently while keeping the original conditional distributions intact.

Core claim

We present an inference-time method to convert any autoregressive LM with a BPE tokenizer into a character-level or byte-level LM. Our method efficiently solves the PBP and is also able to unify the vocabularies of language models with different tokenizers, allowing one to ensemble LMs with different tokenizers at inference time or transfer the post-training from one model to another using proxy-tuning.

What carries the argument

The byte-level sampling conversion, which reparameterizes token probabilities so that the model can be sampled one byte at a time while exactly matching the original autoregressive distributions over text.

If this is right

  • The prompt boundary problem is solved for any prompt, including those ending in spaces and for code or Chinese text.
  • Language models with incompatible tokenizers can be combined into ensembles at inference time.
  • Post-training performed on one model can be transferred to another via proxy-tuning without retraining the target.
  • The original conditional probability distributions over sequences remain unchanged by the conversion.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The technique may improve output consistency in multilingual and programming contexts where token boundaries rarely match linguistic or syntactic units.
  • Similar reparameterization ideas could be tested on tokenizers other than BPE to broaden applicability.
  • Targeted experiments on edge-case prompts in Chinese or code would directly measure whether tokenization distortions disappear.

Load-bearing premise

Byte-level sampling can be performed at inference time while preserving the original model's exact conditional distributions over byte sequences and without prohibitive slowdown.

What would settle it

A side-by-side comparison in which the probability of a given byte sequence differs between the original model and the converted byte sampler on the same prompt, or where inference latency increases substantially.

Figures

Figures reproduced from arXiv: 2506.14123 by Alisa Liu, Jonathan Hayase, Noah A. Smith, Sewoong Oh.

Figure 1
Figure 1. Figure 1: ByteSampler resolves the prompt boundary problem (exhibited in the output of generate()). In this example, test, 都是, and .getElementById are all single tokens in the respective tokenizers. The Prompt Boundary Problem (PBP). In particular, Eq. (1) introduces distor￾tion whenever the prompt ends on a pre￾fix of what could otherwise be a single token. More concretely, consider LLAMA￾3.2-1B and suppose the use… view at source ↗
Figure 2
Figure 2. Figure 2: Construction of the Valid Covering Tree for string prefix “hypot”: (a) starting with the infinite tree of all possible token sequences (many edges not shown), we prune branches that (b) do not match the given prefix or begin after the prefix ends or (c) contain invalid contiguous pairs of tokens. More example trees are shown in Appendix E. The tree depicted in Fig. 2b corresponds to the cover described in … view at source ↗
Figure 3
Figure 3. Figure 3: Example of valid and invalid token pairs. We show the initial string’s bytes and the merges [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Example Valid Cover Tree for prefix “this is a tes” with the OLM [PITH_FULL_IMAGE:figures/full_fig_p022_4.png] view at source ↗
Figure 6
Figure 6. Figure 6: Example Valid Cover Tree for prefix “BPE Tokenizatio” with the OLM [PITH_FULL_IMAGE:figures/full_fig_p022_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Example Valid Cover Tree for prefix “inductive hypothe” with the OLM [PITH_FULL_IMAGE:figures/full_fig_p022_7.png] view at source ↗
read the original abstract

Tokenization is used almost universally by modern language models, enabling efficient text representation using multi-byte or multi-character tokens. However, prior work has shown that tokenization can introduce distortion into the model's generations, an issue known as the Prompt Boundary Problem (PBP). For example, users are often advised not to end their prompts with a space because it prevents the model from including the space as part of the next token. While this heuristic is effective in English, the underlying PBP continues to affect code generation and languages such as Chinese, where tokens often do not line up with word and syntactic boundaries. In this work, we present an inference-time method to convert any autoregressive LM with a BPE tokenizer into a character-level or byte-level LM. Our method efficiently solves the PBP and is also able to unify the vocabularies of language models with different tokenizers, allowing one to ensemble LMs with different tokenizers at inference time or transfer the post-training from one model to another using proxy-tuning. Code is available at https://github.com/SewoongLab/byte-sampler .

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper claims to present an inference-time method that converts any autoregressive language model using a BPE tokenizer into a byte-level or character-level model. This approach is said to efficiently solve the Prompt Boundary Problem (PBP) and to unify vocabularies of models with different tokenizers, enabling ensembling at inference time or proxy-tuning for post-training transfer.

Significance. If the byte-level sampling exactly preserves the original model's conditional distributions, the work would provide a practical fix for PBP that impacts code generation and languages like Chinese, where token boundaries do not align with syntactic units. It would also facilitate cross-tokenizer model combinations without retraining. The open availability of code is a strength for verification and extension.

major comments (1)
  1. The central technical claim requires that the byte-level sampler computes the exact marginal probability P(next byte | history) by summing over all tokens whose byte prefix matches the current partial token state. The skeptic's concern is that any mismatch in this marginalization or failure to track ongoing tokenizations would cause the generated distribution to diverge from the original model. Please provide the derivation or pseudocode (e.g., in the algorithm description) showing how the trie or DP maintains exact equivalence classes without approximation.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their careful review and for recognizing the potential impact of our inference-time byte-level sampling approach on the Prompt Boundary Problem and cross-tokenizer unification. We address the major technical comment below and have revised the manuscript to provide the requested clarification.

read point-by-point responses
  1. Referee: The central technical claim requires that the byte-level sampler computes the exact marginal probability P(next byte | history) by summing over all tokens whose byte prefix matches the current partial token state. The skeptic's concern is that any mismatch in this marginalization or failure to track ongoing tokenizations would cause the generated distribution to diverge from the original model. Please provide the derivation or pseudocode (e.g., in the algorithm description) showing how the trie or DP maintains exact equivalence classes without approximation.

    Authors: We appreciate the referee's emphasis on verifying exact equivalence. Our byte sampler maintains a trie over the BPE vocabulary and uses dynamic programming to track the set of all active token prefixes consistent with the observed byte history. At each step the probability of the next byte b is computed exactly as the sum of the model's token probabilities for every vocabulary token whose byte string begins with the current prefix plus b, divided by the total probability mass of all tokens consistent with the current prefix. This marginalization is performed without approximation or sampling and preserves the original conditional distribution over byte sequences by construction. We have added a formal derivation in Section 3.2 together with explicit pseudocode (Algorithm 1) in the revised manuscript that illustrates the equivalence-class maintenance. revision: yes

Circularity Check

0 steps flagged

No circularity: new algorithmic conversion technique with no self-referential derivations or fitted predictions

full rationale

The paper presents an inference-time algorithmic method to convert BPE-tokenized LMs to byte-level sampling, solving the Prompt Boundary Problem and enabling vocabulary unification for ensembling or proxy-tuning. The abstract and described claims introduce a new technique without equations, parameter fits, or derivations that reduce to inputs by construction. No self-citation load-bearing steps, uniqueness theorems, or ansatzes are invoked in the provided text to justify core results. The central claim rests on the correctness of the marginalization implementation rather than any definitional equivalence or renamed empirical pattern. This is a standard non-finding for a methods paper whose contribution is procedural rather than deductive.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on standard assumptions about autoregressive factorization and the ability to re-express token probabilities over bytes; no new free parameters or invented entities are introduced in the abstract.

axioms (1)
  • domain assumption Autoregressive language models factorize probability over tokens that can be re-expressed at the byte level
    Implicit in the claim that any BPE model can be converted to byte-level sampling

pith-pipeline@v0.9.0 · 5728 in / 1102 out tokens · 26648 ms · 2026-05-19T09:49:27.508854+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

96 extracted references · 96 canonical work pages · 3 internal anchors

  1. [1]

    O. Ahia, S. Kumar, H. Gonen, V . Hofmann, T. Limisiewicz, Y . Tsvetkov, and N. A. Smith. MAGNET: Improving the multilingual fairness of language models with adaptive gradient- based tokenization. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URL https://openreview.net/forum?id=1e3MOwHSIX

  2. [2]

    Athiwaratkun, S

    B. Athiwaratkun, S. Wang, M. Shang, Y . Tian, Z. Wang, S. K. Gonugondla, S. K. Gouda, R. Kwiatkowski, R. Nallapati, P. Bhatia, and B. Xiang. Token alignment via character matching for subword completion. In L.-W. Ku, A. Martins, and V . Srikumar, editors,Findings of the Association for Computational Linguistics: ACL 2024, pages 15725–15738, Bangkok, Thail...

  3. [3]

    Berglund and B

    M. Berglund and B. van der Merwe. Formalizing bpe tokenization. In 13th International Work- shop on Non-Classical Models of Automata and Applications, NCMA 2023, 18-19 September, 2023, Famagusta, Cyprus, pages 16–27. Open Publishing Association, 2023

  4. [4]

    Beyond the imitation game: Quantifying and extrapolating the capabilities of language models

    BIG-bench. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. Transactions on Machine Learning Research, 2023. ISSN 2835-8856. URL https://openreview.net/forum?id=uyTL5Bvosj

  5. [5]

    T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-V oss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amo...

  6. [6]

    Cao and L

    K. Cao and L. Rimell. You should evaluate your language model on marginal likelihood over tokenisations. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 2104–2114, 2021. 10

  7. [7]

    Y . Chen, K. Marchisio, R. Raileanu, D. Adelani, P. Stenetorp, S. Riedel, and M. Artetxe. Improving language plasticity via pretraining with active forgetting. In Advances in Neural Information Processing Systems. NeurIPS, 2023

  8. [8]

    Z. Chen, J. Li, P. Chen, Z. Li, K. Sun, Y . Luo, Q. Mao, D. Yang, H. Sun, and P. S. Yu. Harnessing multiple large language models: A survey on llm ensemble. arXiv preprint arXiv:2502.18036, 2025

  9. [9]

    Chirkova, G

    N. Chirkova, G. Kruszewski, J. Rozen, and M. Dymetman. Should you marginalize over possible tokenizations? In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 1–12, 2023

  10. [10]

    Chizhov, C

    P. Chizhov, C. Arnett, E. Korotkova, and I. P. Yamshchikov. Bpe gets picky: Efficient vocabulary refinement during tokenizer training. arXiv preprint arXiv:2409.04599, 2024

  11. [11]

    Chuang, Y

    Y .-S. Chuang, Y . Xie, H. Luo, Y . Kim, J. R. Glass, and P. He. Dola: Decoding by contrasting layers improves factuality in large language models. In The Twelfth International Conference on Learning Representations, 2024

  12. [12]

    J. H. Clark, D. Garrette, I. Turc, and J. Wieting. Canine: Pre-training an efficient tokenization- free encoder for language representation. Transactions of the Association for Computational Linguistics, 10:73–91, 2022. doi: 10.1162/tacl_a_00448. URL https://aclanthology. org/2022.tacl-1.5

  13. [13]

    Dagan, G

    G. Dagan, G. Synnaeve, and B. Rozière. Getting the most out of your tokenizer for pre-training and domain adaptation. In Proceedings of the 41st International Conference on Machine Learning, ICML’24. JMLR.org, 2024

  14. [14]

    Dobler and G

    K. Dobler and G. De Melo. Focus: Effective embedding initialization for monolingual special- ization of multilingual models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 13440–13454, 2023

  15. [15]

    D. Dua, Y . Wang, P. Dasigi, G. Stanovsky, S. Singh, and M. Gardner. DROP: A reading comprehension benchmark requiring discrete reasoning over paragraphs. In J. Burstein, C. Doran, and T. Solorio, editors, Proceedings of the 2019 Conference of the North Amer- ican Chapter of the Association for Computational Linguistics: Human Language Tech- nologies, Vol...

  16. [16]

    P. Gage. A new algorithm for data compression. The C Users Journal archive, 12:23–38, 1994. URLhttps://api.semanticscholar.org/CorpusID:59804030

  17. [17]

    L. Gee, A. Zugarini, L. Rigutini, P. Torroni, et al. Fast vocabulary transfer for language model compression. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: Industry Track, pages 409–416. Association for Computational Linguistics (ACL), 2022

  18. [18]

    L. Gee, L. Rigutini, M. Ernandes, and A. Zugarini. Multi-word tokenization for sequence compression. In M. Wang and I. Zitouni, editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: Industry Track, pages 612–621, Singapore, Dec. 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-industry

  19. [19]

    URL https://aclanthology.org/2023.emnlp-industry.58

  20. [20]

    R. L. Geh, H. Zhang, K. Ahmed, B. Wang, and G. V . d. Broeck. Where is the signal in tokenization space? arXiv preprint arXiv:2408.08541, 2024

  21. [21]

    A. Gera, R. Friedman, O. Arviv, C. Gunasekara, B. Sznajder, N. Slonim, and E. Shnarch. The benefits of bad advice: Autocontrastive decoding across model layers. In A. Rogers, J. Boyd- Graber, and N. Okazaki, editors, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 10406–10420, Toronto,...

  22. [22]

    Holtzman, J

    A. Holtzman, J. Buys, L. Du, M. Forbes, and Y . Choi. The curious case of neural text degeneration. In International Conference on Learning Representations, 2020

  23. [23]

    Huang, X

    Y . Huang, X. Feng, B. Li, Y . Xiang, H. Wang, T. Liu, and B. Qin. Ensemble learning for heterogeneous large language models with deep parallel collaboration. Advances in Neural Information Processing Systems, 37:119838–119860, 2024

  24. [24]

    J. Jackson. Character prefix conditioning, 2025. URL https://www.cursor.com/blog/cpc

  25. [25]

    TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension

    M. Joshi, E. Choi, D. Weld, and L. Zettlemoyer. TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension. In R. Barzilay and M.-Y . Kan, editors,Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1601–1611, Vancouver, Canada, July 2017. Association for Comp...

  26. [26]

    Kasai, K

    J. Kasai, K. Sakaguchi, R. Le Bras, H. Peng, X. Lu, D. Radev, Y . Choi, and N. A. Smith. Twist decoding: Diverse generators guide each other. In Y . Goldberg, Z. Kozareva, and Y . Zhang, editors, Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 4909–4923, Abu Dhabi, United Arab Emirates, Dec. 2022. Association ...

  27. [27]

    T. Kudo. Subword regularization: Improving neural network translation models with multi- ple subword candidates. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 66–75, 2018

  28. [28]

    Kudo and J

    T. Kudo and J. Richardson. Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 66–71, 2018

  29. [29]

    Kudugunta, I

    S. Kudugunta, I. Caswell, B. Zhang, X. Garcia, D. Xin, A. Kusupati, R. Stella, A. Bapna, and O. Firat. Madlad-400: A multilingual and document-level large audited dataset. Advances in Neural Information Processing Systems, 36:67284–67296, 2023

  30. [30]

    Kumar and A

    D. Kumar and A. Thawani. BPE beyond word boundary: How NOT to use multi word expres- sions in neural machine translation. In S. Tafreshi, J. Sedoc, A. Rogers, A. Drozd, A. Rumshisky, and A. Akula, editors, Proceedings of the Third Workshop on Insights from Negative Results in NLP, pages 172–179, Dublin, Ireland, May 2022. Association for Computational Lin...

  31. [31]

    Lambert, J

    N. Lambert, J. Morrison, V . Pyatkin, S. Huang, H. Ivison, F. Brahman, L. J. V . Miranda, A. Liu, N. Dziri, S. Lyu, Y . Gu, S. Malik, V . Graf, J. D. Hwang, J. Yang, R. L. Bras, O. Tafjord, C. Wilhelm, L. Soldaini, N. A. Smith, Y . Wang, P. Dasigi, and H. Hajishirzi. Tulu 3: Pushing frontiers in open language model post-training, 2025. URL https://arxiv.o...

  32. [32]

    X. L. Li, A. Holtzman, D. Fried, P. Liang, J. Eisner, T. Hashimoto, L. Zettlemoyer, and M. Lewis. Contrastive decoding: Open-ended text generation as optimization. In A. Rogers, J. Boyd- Graber, and N. Okazaki, editors, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12286–12312, Toron...

  33. [33]

    A. Liu, M. Sap, X. Lu, S. Swayamdipta, C. Bhagavatula, N. A. Smith, and Y . Choi. DEx- perts: Decoding-time controlled text generation with experts and anti-experts. In C. Zong, F. Xia, W. Li, and R. Navigli, editors, Proceedings of the 59th Annual Meeting of the As- sociation for Computational Linguistics and the 11th International Joint Conference on Na...

  34. [34]

    doi: 10.18653/v1/2021.acl-long.522

    Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-long.522. URL https://aclanthology.org/2021.acl-long.522. 12

  35. [35]

    A. Liu, X. Han, Y . Wang, Y . Tsvetkov, Y . Choi, and N. A. Smith. Tuning language models by proxy. In First Conference on Language Modeling, 2024

  36. [36]

    A. Liu, J. Hayase, V . Hofmann, S. Oh, N. A. Smith, and Y . Choi. SuperBPE: Space travel for language models. arXiv preprint arXiv:2503.13423, 2025. URL https://arxiv.org/abs/ 2503.13423

  37. [37]

    C. Liu, X. Quan, Y . Pan, L. Lin, W. Wu, and X. Chen. Cool-fusion: Fuse large language models without training. arXiv preprint arXiv:2407.19807, 2024

  38. [38]

    Y . Liu, P. Lin, M. Wang, and H. Schütze. Ofa: A framework of initializing unseen subword embeddings for efficient large-scale multilingual continued pretraining. In Findings of the Association for Computational Linguistics: NAACL 2024, pages 1067–1097, 2024

  39. [39]

    Lundberg

    S. Lundberg. The art of prompt design: Prompt boundaries and token heal- ing, 2023. URL https://medium.com/towards-data-science/the-art-of-prompt- design-prompt-boundaries-and-token-healing-3b2448b0be38

  40. [40]

    B. Lv, C. Tang, Y . Zhang, X. Liu, Y . Yu, and P. Luo. Specfuse: Ensembling large language models via next-segment prediction. arXiv preprint arXiv:2412.07380, 2024

  41. [41]

    Marchisio, P

    K. Marchisio, P. Lewis, Y . Chen, and M. Artetxe. Mini-model adaptation: Efficiently extending pretrained models to new languages via aligned shallow training. In The 61st Annual Meeting Of The Association For Computational Linguistics, 2023

  42. [42]

    Mavromatis, P

    C. Mavromatis, P. Karypis, and G. Karypis. Pack of llms: Model fusion at test-time via perplexity optimization. In First Conference on Language Modeling, 2024

  43. [43]

    S. J. Mielke. Can you compare perplexity across different segmentations?, Apr 2019. URL https://sjmielke.com/comparing-perplexities.htm

  44. [44]

    Minixhofer, F

    B. Minixhofer, F. Paischer, and N. Rekabsaz. Wechsel: Effective initialization of subword embeddings for cross-lingual transfer of monolingual language models. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3992–4006, 2022

  45. [45]

    Minixhofer, E

    B. Minixhofer, E. Ponti, and I. Vuli´c. Zero-shot tokenizer transfer. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

  46. [46]

    Minixhofer, I

    B. Minixhofer, I. Vuli´c, and E. M. Ponti. Universal cross-tokenizer distillation via approximate likelihood matching. arXiv preprint arXiv:2503.20083, 2025

  47. [47]

    Nawrot, J

    P. Nawrot, J. Chorowski, A. Lancucki, and E. M. Ponti. Efficient transformers with dynamic token pooling. In A. Rogers, J. Boyd-Graber, and N. Okazaki, editors, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages 6403–6417, Toronto, Canada, July 2023. Association for Computational Linguis...

  48. [48]

    Oh and W

    B.-D. Oh and W. Schuler. Leading whitespaces of language models’ subword vocabulary pose a confound for calculating word probabilities. In Y . Al-Onaizan, M. Bansal, and Y .-N. Chen, editors, Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 3464–3472, Miami, Florida, USA, Nov. 2024. Association for Computationa...

  49. [49]

    T. OLMo, P. Walsh, L. Soldaini, D. Groeneveld, K. Lo, S. Arora, A. Bhagia, Y . Gu, S. Huang, M. Jordan, N. Lambert, D. Schwenk, O. Tafjord, T. Anderson, D. Atkinson, F. Brahman, C. Clark, P. Dasigi, N. Dziri, M. Guerquin, H. Ivison, P. W. Koh, J. Liu, S. Malik, W. Merrill, L. J. V . Miranda, J. Morrison, T. Murray, C. Nam, V . Pyatkin, A. Rangapur, M. Sch...

  50. [50]

    Openai platform documentation, 2023

    OpenAI. Openai platform documentation, 2023. URL https://platform.openai.com/ docs. Accessed: 2025/05/10. 13

  51. [51]

    Pagnoni, R

    A. Pagnoni, R. Pasunuru, P. Rodriguez, J. Nguyen, B. Muller, M. Li, C. Zhou, L. Yu, J. Weston, L. Zettlemoyer, G. Ghosh, M. Lewis, A. Holtzman, and S. Iyer. Byte latent transformer: Patches scale better than tokens, 2024. URL https://arxiv.org/abs/2412.09871

  52. [52]

    Paperno, G

    D. Paperno, G. Kruszewski, A. Lazaridou, N. Q. Pham, R. Bernardi, S. Pezzelle, M. Baroni, G. Boleda, and R. Fernández. The LAMBADA dataset: Word prediction requiring a broad discourse context. In K. Erk and N. A. Smith, editors, Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1525–1534...

  53. [53]

    URL https://aclanthology.org/P16-1144

  54. [54]

    B. Phan, M. Havasi, M. Muckley, and K. Ullrich. Understanding and mitigating tokenization bias in language models, 2024. URL https://arxiv.org/abs/2406.16829

  55. [55]

    Pimentel and C

    T. Pimentel and C. Meister. How to compute the probability of a word. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 18358–18375, 2024

  56. [56]

    Provilkov, D

    I. Provilkov, D. Emelianenko, and E. V oita. Bpe-dropout: Simple and effective subword regularization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 2020

  57. [57]

    Rajpurkar, J

    P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang. SQuAD: 100,000+ questions for machine comprehension of text. In J. Su, K. Duh, and X. Carreras, editors, Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2383–2392, Austin, Texas, Nov. 2016. Association for Computational Linguistics. doi: 10.18653/v1/D16-1264. U...

  58. [58]

    M. T. Ribeiro. A guidance language for controlling large language models, 2023. URL https: //github.com/guidance-ai/guidance?tab=readme-ov-file#text-not-tokens

  59. [59]

    R. S. 4d masks support in transformers, 2024. URL https://huggingface.co/blog/ poedator/4d-masks

  60. [60]

    Schuster and K

    M. Schuster and K. Nakajima. Japanese and korean voice search. In 2012 IEEE international conference on acoustics, speech and signal processing (ICASSP) , pages 5149–5152. IEEE, 2012

  61. [61]

    Sennrich, B

    R. Sennrich, B. Haddow, and A. Birch. Neural machine translation of rare words with subword units. In K. Erk and N. A. Smith, editors, Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1715–1725, Berlin, Germany, Aug. 2016. Association for Computational Linguistics. doi: 10.18653/v1/P16-...

  62. [62]

    R. Shi, Y . Chen, Y . Hu, A. Liu, H. Hajishirzi, N. A. Smith, and S. S. Du. Decoding-time language model alignment with multiple objectives. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

  63. [63]

    W. Shi, X. Han, M. Lewis, Y . Tsvetkov, L. Zettlemoyer, and W.-t. Yih. Trusting your evidence: Hallucinate less with context-aware decoding. In K. Duh, H. Gomez, and S. Bethard, editors, Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers), p...

  64. [64]

    Y . Tay, V . Q. Tran, S. Ruder, J. Gupta, H. W. Chung, D. Bahri, Z. Qin, S. Baumgartner, C. Yu, and D. Metzler. Charformer: Fast character transformers via gradient-based subword tokenization. In International Conference on Learning Representations, 2022. URL https: //openreview.net/forum?id=JtBRnrlOEFN

  65. [65]

    L. Team. The llama 3 herd of models, 2024. URL https://arxiv.org/abs/2407.21783. 14

  66. [66]

    L. Team. Introducing llama 3.1: Our most capable models to date, 2024. URL https: //ai.meta.com/blog/meta-llama-3-1/

  67. [67]

    L. Team. Llama 3.2: Revolutionizing edge ai and vision with open, customizable models, 2024. URLhttps://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile- devices/. Accessed: 2025/05/10

  68. [68]

    O. Team. Olmo release notes, 2025. URL https://allenai.org/olmo/release-notes# olmo-2-1b. Accessed: 2025/05/10

  69. [69]

    Q. Team. Qwen3: Think deeper, act faster, 2025. URL https://qwenlm.github.io/blog/ qwen3/. Accessed: 2025/05/10

  70. [70]

    K. Tran. From english to foreign languages: Transferring pre-trained language models. arXiv preprint arXiv:2002.07306, 2020

  71. [71]

    B. Tunguz. 200,000+ jeopardy! questions, 1019. URL https://www.kaggle.com/ datasets/tunguz/200000-jeopardy-questions

  72. [72]

    B. Tunguz. 200,000+ jeopardy! questions, 2019. URL https://www.kaggle.com/ datasets/tunguz/200000-jeopardy-questions

  73. [73]

    A. Turaga. Character prefix conditioning with back tokenization, 2025. URL https:// anilturaga.github.io/cpc

  74. [74]

    van Antwerpen and A

    H. van Antwerpen and A. Neubeck. So many tokens, so little time: Introducing a faster, more flexible byte-pair tokenizer, 2025. URL https://github.blog/ai-and-ml/llms/so- many-tokens-so-little-time-introducing-a-faster-more-flexible-byte- pair-tokenizer/. Accessed: 2025/05/10

  75. [75]

    Vieira, B

    T. Vieira, B. LeBrun, M. Giulianelli, J. L. Gastaldi, B. DuSell, J. Terilla, T. J. O’Donnell, and R. Cotterell. From language models over tokens to language models over characters. arXiv preprint arXiv:2412.03719, 2024

  76. [76]

    Vieira, T

    T. Vieira, T. Liu, C. Pasti, Y . Emara, B. DuSell, B. LeBrun, M. Giulianelli, J. L. Gastaldi, T. J. O’Donnell, and R. Cotterell. Language models over canonical byte-pair encodings. arXiv preprint arXiv:2506.07956, 2025

  77. [77]

    J. Wang, T. Gangavarapu, J. N. Yan, and A. M. Rush. Mambabyte: Token-free selective state space model. In First Conference on Language Modeling, 2024. URL https://openreview. net/forum?id=X1xNsuKssb

  78. [78]

    Y . Xu, J. Chen, J. Wu, and J. Zhang. Hit the sweet spot! span-level ensemble for large language models. arXiv preprint arXiv:2409.18583, 2024

  79. [79]

    Y . Xu, J. Lu, and J. Zhang. Bridging the gap between different vocabularies for llm ensemble. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 7133–7145, 2024

  80. [80]

    Y . Xu, J. Chen, J. Wu, and J. Zhang. Hit the sweet spot! span-level ensemble for large language models. In Proceedings of the 31st International Conference on Computational Linguistics, pages 8314–8325, 2025

Showing first 80 references.