Sampling from Your Language Model One Byte at a Time
Pith reviewed 2026-05-19 09:49 UTC · model grok-4.3
The pith
A new inference-time method converts any BPE language model into a byte-level sampler that eliminates the prompt boundary problem.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We present an inference-time method to convert any autoregressive LM with a BPE tokenizer into a character-level or byte-level LM. Our method efficiently solves the PBP and is also able to unify the vocabularies of language models with different tokenizers, allowing one to ensemble LMs with different tokenizers at inference time or transfer the post-training from one model to another using proxy-tuning.
What carries the argument
The byte-level sampling conversion, which reparameterizes token probabilities so that the model can be sampled one byte at a time while exactly matching the original autoregressive distributions over text.
If this is right
- The prompt boundary problem is solved for any prompt, including those ending in spaces and for code or Chinese text.
- Language models with incompatible tokenizers can be combined into ensembles at inference time.
- Post-training performed on one model can be transferred to another via proxy-tuning without retraining the target.
- The original conditional probability distributions over sequences remain unchanged by the conversion.
Where Pith is reading between the lines
- The technique may improve output consistency in multilingual and programming contexts where token boundaries rarely match linguistic or syntactic units.
- Similar reparameterization ideas could be tested on tokenizers other than BPE to broaden applicability.
- Targeted experiments on edge-case prompts in Chinese or code would directly measure whether tokenization distortions disappear.
Load-bearing premise
Byte-level sampling can be performed at inference time while preserving the original model's exact conditional distributions over byte sequences and without prohibitive slowdown.
What would settle it
A side-by-side comparison in which the probability of a given byte sequence differs between the original model and the converted byte sampler on the same prompt, or where inference latency increases substantially.
Figures
read the original abstract
Tokenization is used almost universally by modern language models, enabling efficient text representation using multi-byte or multi-character tokens. However, prior work has shown that tokenization can introduce distortion into the model's generations, an issue known as the Prompt Boundary Problem (PBP). For example, users are often advised not to end their prompts with a space because it prevents the model from including the space as part of the next token. While this heuristic is effective in English, the underlying PBP continues to affect code generation and languages such as Chinese, where tokens often do not line up with word and syntactic boundaries. In this work, we present an inference-time method to convert any autoregressive LM with a BPE tokenizer into a character-level or byte-level LM. Our method efficiently solves the PBP and is also able to unify the vocabularies of language models with different tokenizers, allowing one to ensemble LMs with different tokenizers at inference time or transfer the post-training from one model to another using proxy-tuning. Code is available at https://github.com/SewoongLab/byte-sampler .
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims to present an inference-time method that converts any autoregressive language model using a BPE tokenizer into a byte-level or character-level model. This approach is said to efficiently solve the Prompt Boundary Problem (PBP) and to unify vocabularies of models with different tokenizers, enabling ensembling at inference time or proxy-tuning for post-training transfer.
Significance. If the byte-level sampling exactly preserves the original model's conditional distributions, the work would provide a practical fix for PBP that impacts code generation and languages like Chinese, where token boundaries do not align with syntactic units. It would also facilitate cross-tokenizer model combinations without retraining. The open availability of code is a strength for verification and extension.
major comments (1)
- The central technical claim requires that the byte-level sampler computes the exact marginal probability P(next byte | history) by summing over all tokens whose byte prefix matches the current partial token state. The skeptic's concern is that any mismatch in this marginalization or failure to track ongoing tokenizations would cause the generated distribution to diverge from the original model. Please provide the derivation or pseudocode (e.g., in the algorithm description) showing how the trie or DP maintains exact equivalence classes without approximation.
Simulated Author's Rebuttal
We thank the referee for their careful review and for recognizing the potential impact of our inference-time byte-level sampling approach on the Prompt Boundary Problem and cross-tokenizer unification. We address the major technical comment below and have revised the manuscript to provide the requested clarification.
read point-by-point responses
-
Referee: The central technical claim requires that the byte-level sampler computes the exact marginal probability P(next byte | history) by summing over all tokens whose byte prefix matches the current partial token state. The skeptic's concern is that any mismatch in this marginalization or failure to track ongoing tokenizations would cause the generated distribution to diverge from the original model. Please provide the derivation or pseudocode (e.g., in the algorithm description) showing how the trie or DP maintains exact equivalence classes without approximation.
Authors: We appreciate the referee's emphasis on verifying exact equivalence. Our byte sampler maintains a trie over the BPE vocabulary and uses dynamic programming to track the set of all active token prefixes consistent with the observed byte history. At each step the probability of the next byte b is computed exactly as the sum of the model's token probabilities for every vocabulary token whose byte string begins with the current prefix plus b, divided by the total probability mass of all tokens consistent with the current prefix. This marginalization is performed without approximation or sampling and preserves the original conditional distribution over byte sequences by construction. We have added a formal derivation in Section 3.2 together with explicit pseudocode (Algorithm 1) in the revised manuscript that illustrates the equivalence-class maintenance. revision: yes
Circularity Check
No circularity: new algorithmic conversion technique with no self-referential derivations or fitted predictions
full rationale
The paper presents an inference-time algorithmic method to convert BPE-tokenized LMs to byte-level sampling, solving the Prompt Boundary Problem and enabling vocabulary unification for ensembling or proxy-tuning. The abstract and described claims introduce a new technique without equations, parameter fits, or derivations that reduce to inputs by construction. No self-citation load-bearing steps, uniqueness theorems, or ansatzes are invoked in the provided text to justify core results. The central claim rests on the correctness of the marginalization implementation rather than any definitional equivalence or renamed empirical pattern. This is a standard non-finding for a methods paper whose contribution is procedural rather than deductive.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Autoregressive language models factorize probability over tokens that can be re-expressed at the byte level
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We introduce an efficient procedure to condition a BPE tokenizer-based model on an arbitrary byte-prefix... using the Valid Covering Tree... pairwise validation... Proposition 3.1
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
The tree represents exactly the set of valid sequences of tokens with the prompt as a prefix... bounded depth... constant time updates
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
O. Ahia, S. Kumar, H. Gonen, V . Hofmann, T. Limisiewicz, Y . Tsvetkov, and N. A. Smith. MAGNET: Improving the multilingual fairness of language models with adaptive gradient- based tokenization. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URL https://openreview.net/forum?id=1e3MOwHSIX
work page 2024
-
[2]
B. Athiwaratkun, S. Wang, M. Shang, Y . Tian, Z. Wang, S. K. Gonugondla, S. K. Gouda, R. Kwiatkowski, R. Nallapati, P. Bhatia, and B. Xiang. Token alignment via character matching for subword completion. In L.-W. Ku, A. Martins, and V . Srikumar, editors,Findings of the Association for Computational Linguistics: ACL 2024, pages 15725–15738, Bangkok, Thail...
-
[3]
M. Berglund and B. van der Merwe. Formalizing bpe tokenization. In 13th International Work- shop on Non-Classical Models of Automata and Applications, NCMA 2023, 18-19 September, 2023, Famagusta, Cyprus, pages 16–27. Open Publishing Association, 2023
work page 2023
-
[4]
Beyond the imitation game: Quantifying and extrapolating the capabilities of language models
BIG-bench. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. Transactions on Machine Learning Research, 2023. ISSN 2835-8856. URL https://openreview.net/forum?id=uyTL5Bvosj
work page 2023
-
[5]
T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-V oss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amo...
work page 1901
- [6]
-
[7]
Y . Chen, K. Marchisio, R. Raileanu, D. Adelani, P. Stenetorp, S. Riedel, and M. Artetxe. Improving language plasticity via pretraining with active forgetting. In Advances in Neural Information Processing Systems. NeurIPS, 2023
work page 2023
-
[8]
Z. Chen, J. Li, P. Chen, Z. Li, K. Sun, Y . Luo, Q. Mao, D. Yang, H. Sun, and P. S. Yu. Harnessing multiple large language models: A survey on llm ensemble. arXiv preprint arXiv:2502.18036, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[9]
N. Chirkova, G. Kruszewski, J. Rozen, and M. Dymetman. Should you marginalize over possible tokenizations? In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 1–12, 2023
work page 2023
-
[10]
P. Chizhov, C. Arnett, E. Korotkova, and I. P. Yamshchikov. Bpe gets picky: Efficient vocabulary refinement during tokenizer training. arXiv preprint arXiv:2409.04599, 2024
- [11]
-
[12]
J. H. Clark, D. Garrette, I. Turc, and J. Wieting. Canine: Pre-training an efficient tokenization- free encoder for language representation. Transactions of the Association for Computational Linguistics, 10:73–91, 2022. doi: 10.1162/tacl_a_00448. URL https://aclanthology. org/2022.tacl-1.5
- [13]
-
[14]
K. Dobler and G. De Melo. Focus: Effective embedding initialization for monolingual special- ization of multilingual models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 13440–13454, 2023
work page 2023
-
[15]
D. Dua, Y . Wang, P. Dasigi, G. Stanovsky, S. Singh, and M. Gardner. DROP: A reading comprehension benchmark requiring discrete reasoning over paragraphs. In J. Burstein, C. Doran, and T. Solorio, editors, Proceedings of the 2019 Conference of the North Amer- ican Chapter of the Association for Computational Linguistics: Human Language Tech- nologies, Vol...
-
[16]
P. Gage. A new algorithm for data compression. The C Users Journal archive, 12:23–38, 1994. URLhttps://api.semanticscholar.org/CorpusID:59804030
work page 1994
-
[17]
L. Gee, A. Zugarini, L. Rigutini, P. Torroni, et al. Fast vocabulary transfer for language model compression. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: Industry Track, pages 409–416. Association for Computational Linguistics (ACL), 2022
work page 2022
-
[18]
L. Gee, L. Rigutini, M. Ernandes, and A. Zugarini. Multi-word tokenization for sequence compression. In M. Wang and I. Zitouni, editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: Industry Track, pages 612–621, Singapore, Dec. 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-industry
-
[19]
URL https://aclanthology.org/2023.emnlp-industry.58
work page 2023
- [20]
-
[21]
A. Gera, R. Friedman, O. Arviv, C. Gunasekara, B. Sznajder, N. Slonim, and E. Shnarch. The benefits of bad advice: Autocontrastive decoding across model layers. In A. Rogers, J. Boyd- Graber, and N. Okazaki, editors, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 10406–10420, Toronto,...
-
[22]
A. Holtzman, J. Buys, L. Du, M. Forbes, and Y . Choi. The curious case of neural text degeneration. In International Conference on Learning Representations, 2020
work page 2020
- [23]
-
[24]
J. Jackson. Character prefix conditioning, 2025. URL https://www.cursor.com/blog/cpc
work page 2025
-
[25]
TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension
M. Joshi, E. Choi, D. Weld, and L. Zettlemoyer. TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension. In R. Barzilay and M.-Y . Kan, editors,Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1601–1611, Vancouver, Canada, July 2017. Association for Comp...
-
[26]
J. Kasai, K. Sakaguchi, R. Le Bras, H. Peng, X. Lu, D. Radev, Y . Choi, and N. A. Smith. Twist decoding: Diverse generators guide each other. In Y . Goldberg, Z. Kozareva, and Y . Zhang, editors, Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 4909–4923, Abu Dhabi, United Arab Emirates, Dec. 2022. Association ...
-
[27]
T. Kudo. Subword regularization: Improving neural network translation models with multi- ple subword candidates. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 66–75, 2018
work page 2018
-
[28]
T. Kudo and J. Richardson. Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 66–71, 2018
work page 2018
-
[29]
S. Kudugunta, I. Caswell, B. Zhang, X. Garcia, D. Xin, A. Kusupati, R. Stella, A. Bapna, and O. Firat. Madlad-400: A multilingual and document-level large audited dataset. Advances in Neural Information Processing Systems, 36:67284–67296, 2023
work page 2023
-
[30]
D. Kumar and A. Thawani. BPE beyond word boundary: How NOT to use multi word expres- sions in neural machine translation. In S. Tafreshi, J. Sedoc, A. Rogers, A. Drozd, A. Rumshisky, and A. Akula, editors, Proceedings of the Third Workshop on Insights from Negative Results in NLP, pages 172–179, Dublin, Ireland, May 2022. Association for Computational Lin...
-
[31]
N. Lambert, J. Morrison, V . Pyatkin, S. Huang, H. Ivison, F. Brahman, L. J. V . Miranda, A. Liu, N. Dziri, S. Lyu, Y . Gu, S. Malik, V . Graf, J. D. Hwang, J. Yang, R. L. Bras, O. Tafjord, C. Wilhelm, L. Soldaini, N. A. Smith, Y . Wang, P. Dasigi, and H. Hajishirzi. Tulu 3: Pushing frontiers in open language model post-training, 2025. URL https://arxiv.o...
work page 2025
-
[32]
X. L. Li, A. Holtzman, D. Fried, P. Liang, J. Eisner, T. Hashimoto, L. Zettlemoyer, and M. Lewis. Contrastive decoding: Open-ended text generation as optimization. In A. Rogers, J. Boyd- Graber, and N. Okazaki, editors, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12286–12312, Toron...
-
[33]
A. Liu, M. Sap, X. Lu, S. Swayamdipta, C. Bhagavatula, N. A. Smith, and Y . Choi. DEx- perts: Decoding-time controlled text generation with experts and anti-experts. In C. Zong, F. Xia, W. Li, and R. Navigli, editors, Proceedings of the 59th Annual Meeting of the As- sociation for Computational Linguistics and the 11th International Joint Conference on Na...
-
[34]
doi: 10.18653/v1/2021.acl-long.522
Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-long.522. URL https://aclanthology.org/2021.acl-long.522. 12
-
[35]
A. Liu, X. Han, Y . Wang, Y . Tsvetkov, Y . Choi, and N. A. Smith. Tuning language models by proxy. In First Conference on Language Modeling, 2024
work page 2024
- [36]
- [37]
-
[38]
Y . Liu, P. Lin, M. Wang, and H. Schütze. Ofa: A framework of initializing unseen subword embeddings for efficient large-scale multilingual continued pretraining. In Findings of the Association for Computational Linguistics: NAACL 2024, pages 1067–1097, 2024
work page 2024
- [39]
- [40]
-
[41]
K. Marchisio, P. Lewis, Y . Chen, and M. Artetxe. Mini-model adaptation: Efficiently extending pretrained models to new languages via aligned shallow training. In The 61st Annual Meeting Of The Association For Computational Linguistics, 2023
work page 2023
-
[42]
C. Mavromatis, P. Karypis, and G. Karypis. Pack of llms: Model fusion at test-time via perplexity optimization. In First Conference on Language Modeling, 2024
work page 2024
-
[43]
S. J. Mielke. Can you compare perplexity across different segmentations?, Apr 2019. URL https://sjmielke.com/comparing-perplexities.htm
work page 2019
-
[44]
B. Minixhofer, F. Paischer, and N. Rekabsaz. Wechsel: Effective initialization of subword embeddings for cross-lingual transfer of monolingual language models. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3992–4006, 2022
work page 2022
-
[45]
B. Minixhofer, E. Ponti, and I. Vuli´c. Zero-shot tokenizer transfer. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024
work page 2024
-
[46]
B. Minixhofer, I. Vuli´c, and E. M. Ponti. Universal cross-tokenizer distillation via approximate likelihood matching. arXiv preprint arXiv:2503.20083, 2025
-
[47]
P. Nawrot, J. Chorowski, A. Lancucki, and E. M. Ponti. Efficient transformers with dynamic token pooling. In A. Rogers, J. Boyd-Graber, and N. Okazaki, editors, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages 6403–6417, Toronto, Canada, July 2023. Association for Computational Linguis...
-
[48]
B.-D. Oh and W. Schuler. Leading whitespaces of language models’ subword vocabulary pose a confound for calculating word probabilities. In Y . Al-Onaizan, M. Bansal, and Y .-N. Chen, editors, Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 3464–3472, Miami, Florida, USA, Nov. 2024. Association for Computationa...
-
[49]
T. OLMo, P. Walsh, L. Soldaini, D. Groeneveld, K. Lo, S. Arora, A. Bhagia, Y . Gu, S. Huang, M. Jordan, N. Lambert, D. Schwenk, O. Tafjord, T. Anderson, D. Atkinson, F. Brahman, C. Clark, P. Dasigi, N. Dziri, M. Guerquin, H. Ivison, P. W. Koh, J. Liu, S. Malik, W. Merrill, L. J. V . Miranda, J. Morrison, T. Murray, C. Nam, V . Pyatkin, A. Rangapur, M. Sch...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[50]
Openai platform documentation, 2023
OpenAI. Openai platform documentation, 2023. URL https://platform.openai.com/ docs. Accessed: 2025/05/10. 13
work page 2023
-
[51]
A. Pagnoni, R. Pasunuru, P. Rodriguez, J. Nguyen, B. Muller, M. Li, C. Zhou, L. Yu, J. Weston, L. Zettlemoyer, G. Ghosh, M. Lewis, A. Holtzman, and S. Iyer. Byte latent transformer: Patches scale better than tokens, 2024. URL https://arxiv.org/abs/2412.09871
-
[52]
D. Paperno, G. Kruszewski, A. Lazaridou, N. Q. Pham, R. Bernardi, S. Pezzelle, M. Baroni, G. Boleda, and R. Fernández. The LAMBADA dataset: Word prediction requiring a broad discourse context. In K. Erk and N. A. Smith, editors, Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1525–1534...
-
[53]
URL https://aclanthology.org/P16-1144
- [54]
-
[55]
T. Pimentel and C. Meister. How to compute the probability of a word. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 18358–18375, 2024
work page 2024
-
[56]
I. Provilkov, D. Emelianenko, and E. V oita. Bpe-dropout: Simple and effective subword regularization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 2020
work page 2020
-
[57]
P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang. SQuAD: 100,000+ questions for machine comprehension of text. In J. Su, K. Duh, and X. Carreras, editors, Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2383–2392, Austin, Texas, Nov. 2016. Association for Computational Linguistics. doi: 10.18653/v1/D16-1264. U...
-
[58]
M. T. Ribeiro. A guidance language for controlling large language models, 2023. URL https: //github.com/guidance-ai/guidance?tab=readme-ov-file#text-not-tokens
work page 2023
-
[59]
R. S. 4d masks support in transformers, 2024. URL https://huggingface.co/blog/ poedator/4d-masks
work page 2024
-
[60]
M. Schuster and K. Nakajima. Japanese and korean voice search. In 2012 IEEE international conference on acoustics, speech and signal processing (ICASSP) , pages 5149–5152. IEEE, 2012
work page 2012
-
[61]
R. Sennrich, B. Haddow, and A. Birch. Neural machine translation of rare words with subword units. In K. Erk and N. A. Smith, editors, Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1715–1725, Berlin, Germany, Aug. 2016. Association for Computational Linguistics. doi: 10.18653/v1/P16-...
-
[62]
R. Shi, Y . Chen, Y . Hu, A. Liu, H. Hajishirzi, N. A. Smith, and S. S. Du. Decoding-time language model alignment with multiple objectives. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024
work page 2024
-
[63]
W. Shi, X. Han, M. Lewis, Y . Tsvetkov, L. Zettlemoyer, and W.-t. Yih. Trusting your evidence: Hallucinate less with context-aware decoding. In K. Duh, H. Gomez, and S. Bethard, editors, Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers), p...
-
[64]
Y . Tay, V . Q. Tran, S. Ruder, J. Gupta, H. W. Chung, D. Bahri, Z. Qin, S. Baumgartner, C. Yu, and D. Metzler. Charformer: Fast character transformers via gradient-based subword tokenization. In International Conference on Learning Representations, 2022. URL https: //openreview.net/forum?id=JtBRnrlOEFN
work page 2022
-
[65]
L. Team. The llama 3 herd of models, 2024. URL https://arxiv.org/abs/2407.21783. 14
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[66]
L. Team. Introducing llama 3.1: Our most capable models to date, 2024. URL https: //ai.meta.com/blog/meta-llama-3-1/
work page 2024
-
[67]
L. Team. Llama 3.2: Revolutionizing edge ai and vision with open, customizable models, 2024. URLhttps://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile- devices/. Accessed: 2025/05/10
work page 2024
-
[68]
O. Team. Olmo release notes, 2025. URL https://allenai.org/olmo/release-notes# olmo-2-1b. Accessed: 2025/05/10
work page 2025
-
[69]
Q. Team. Qwen3: Think deeper, act faster, 2025. URL https://qwenlm.github.io/blog/ qwen3/. Accessed: 2025/05/10
work page 2025
- [70]
-
[71]
B. Tunguz. 200,000+ jeopardy! questions, 1019. URL https://www.kaggle.com/ datasets/tunguz/200000-jeopardy-questions
-
[72]
B. Tunguz. 200,000+ jeopardy! questions, 2019. URL https://www.kaggle.com/ datasets/tunguz/200000-jeopardy-questions
work page 2019
-
[73]
A. Turaga. Character prefix conditioning with back tokenization, 2025. URL https:// anilturaga.github.io/cpc
work page 2025
-
[74]
H. van Antwerpen and A. Neubeck. So many tokens, so little time: Introducing a faster, more flexible byte-pair tokenizer, 2025. URL https://github.blog/ai-and-ml/llms/so- many-tokens-so-little-time-introducing-a-faster-more-flexible-byte- pair-tokenizer/. Accessed: 2025/05/10
work page 2025
- [75]
- [76]
-
[77]
J. Wang, T. Gangavarapu, J. N. Yan, and A. M. Rush. Mambabyte: Token-free selective state space model. In First Conference on Language Modeling, 2024. URL https://openreview. net/forum?id=X1xNsuKssb
work page 2024
- [78]
-
[79]
Y . Xu, J. Lu, and J. Zhang. Bridging the gap between different vocabularies for llm ensemble. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 7133–7145, 2024
work page 2024
-
[80]
Y . Xu, J. Chen, J. Wu, and J. Zhang. Hit the sweet spot! span-level ensemble for large language models. In Proceedings of the 31st International Conference on Computational Linguistics, pages 8314–8325, 2025
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.