Beyond Perplexity: UTF-8 Validity in Byte-aware Language Models
Pith reviewed 2026-06-27 05:13 UTC · model grok-4.3
The pith
Byte-level language models need twice as much training data for valid UTF-8 output as they do to minimize perplexity.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
UTF-8 validity convergence lags perplexity by a factor of two, with perplexity stabilizing after 2.1B tokens but UTF-8 validity requiring 4.2B tokens; in context-free generation, rare characters achieve higher structural validity than common characters, suggesting over-specialization of frequent character representations.
What carries the argument
Evaluation protocols that isolate UTF-8 structural validity from language modeling performance.
If this is right
- Perplexity scores alone do not guarantee reliable UTF-8 generation in byte-level models.
- Additional training beyond typical perplexity convergence points is required for encoding fidelity.
- Rare characters receive less over-specialization and therefore produce more valid sequences than common characters.
- Byte-level models require separate assessment of structural validity in addition to standard metrics.
Where Pith is reading between the lines
- Dedicated objectives that penalize invalid byte sequences could reduce the observed training lag.
- The rare-versus-common difference may appear in other structural properties such as consistent punctuation or script switching.
- Models trained on less balanced corpora could exhibit even larger gaps between perplexity and validity.
Load-bearing premise
The introduced evaluation protocols successfully isolate UTF-8 structural validity from ordinary language-modeling performance.
What would settle it
A replication experiment with altered model size, sampling procedure, or corpus balance that shows UTF-8 validity converging at the same token count as perplexity.
Figures
read the original abstract
Byte-level tokenization enables language models to handle any Unicode input, but models can generate invalid UTF-8 sequences when encountering rare or unseen characters. We investigate the relationship between training scale and UTF-8 generation reliability with a 355M parameter model trained on 80B tokens from a balanced multilingual corpus of English, Japanese, Korean, and Chinese. We introduce multiple evaluation protocols that isolate UTF-8 structural validity from language modeling. UTF-8 validity convergence lags perplexity by a roughly a factor of two: perplexity stabilizes after 2.1B tokens, but UTF-8 validity requires 4.2B tokens. In context-free generation, rare characters achieve higher structural validity than common characters, suggesting over-specialization of frequent character representations. Through experiments, we observed that reliable UTF-8 generation is a distinct capability requiring evaluation beyond perplexity.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that byte-level LMs can produce invalid UTF-8 despite handling any Unicode input, and that with a 355M model trained on 80B tokens from a balanced multilingual corpus (English, Japanese, Korean, Chinese), UTF-8 validity convergence lags perplexity by a factor of roughly two (perplexity stabilizes after 2.1B tokens, UTF-8 validity after 4.2B). It further claims that in context-free generation rare characters achieve higher structural validity than common ones, suggesting over-specialization, and that reliable UTF-8 generation is a distinct capability requiring evaluation protocols beyond perplexity.
Significance. If the evaluation protocols truly isolate UTF-8 structural validity from standard LM performance, the result would show that perplexity alone is insufficient to guarantee reliable generation in byte-aware multilingual models and that training scale requirements differ for structural validity. The empirical observation of a lag and the rare-vs-common reversal would be a useful data point for byte-level modeling, though the single 355M model size and 80B-token setup limit immediate generalizability.
major comments (2)
- [Abstract / protocol description] Abstract and the paragraphs describing the protocols: the central claims (factor-of-two lag between 2.1B and 4.2B tokens; rare > common validity) rest on the assertion that the introduced protocols isolate UTF-8 structural validity from ordinary language-modeling performance, yet no definitions are supplied for validity scoring, convergence criteria, statistical tests, run-to-run variance, or data-exclusion rules. Without these, the lag and the over-specialization interpretation cannot be verified and could be artifacts of the particular corpus balance or sampling procedure.
- [Context-free generation experiments] Results on context-free generation: the claim that rare characters achieve higher structural validity than common characters is presented without reporting the prompting method, temperature, context length, or how 'rare' vs 'common' characters were sampled and counted. This leaves open whether the observed difference is driven by the generation procedure rather than a general property of byte-level representations.
minor comments (1)
- [Abstract] The abstract states numerical thresholds (2.1B, 4.2B) without indicating how stabilization was operationalized; a brief parenthetical definition would improve clarity.
Simulated Author's Rebuttal
We thank the referee for the constructive comments on our manuscript. We agree that the evaluation protocols require more explicit definitions and that the context-free generation experiments need additional methodological details to ensure reproducibility. We will incorporate these changes in a major revision.
read point-by-point responses
-
Referee: [Abstract / protocol description] Abstract and the paragraphs describing the protocols: the central claims (factor-of-two lag between 2.1B and 4.2B tokens; rare > common validity) rest on the assertion that the introduced protocols isolate UTF-8 structural validity from ordinary language-modeling performance, yet no definitions are supplied for validity scoring, convergence criteria, statistical tests, run-to-run variance, or data-exclusion rules. Without these, the lag and the over-specialization interpretation cannot be verified and could be artifacts of the particular corpus balance or sampling procedure.
Authors: We agree that the manuscript lacks sufficient detail on these aspects. In the revised version, we will add a dedicated 'Evaluation Protocols' subsection that defines: UTF-8 validity scoring as the fraction of byte sequences that successfully decode to valid UTF-8 (using Python's decode with errors='strict'); convergence as the training step where a 100M-token moving average changes by less than 0.5% over three consecutive checkpoints; statistical significance via paired t-tests (p<0.05) across three independent runs; run-to-run variance reported as standard deviation; and data-exclusion rules (sequences with >5% control bytes are excluded from validity computation). These additions will allow independent verification of the reported lag and rare-vs-common reversal. revision: yes
-
Referee: [Context-free generation experiments] Results on context-free generation: the claim that rare characters achieve higher structural validity than common characters is presented without reporting the prompting method, temperature, context length, or how 'rare' vs 'common' characters were sampled and counted. This leaves open whether the observed difference is driven by the generation procedure rather than a general property of byte-level representations.
Authors: We acknowledge the omission of these details. The revision will expand the relevant section to specify: context-free generation uses an empty prompt (no prefix tokens); temperature=1.0 with top-p=0.95; generation starts from a 0-token context; 'rare' characters are defined as those with frequency <100 in the 80B-token corpus while 'common' are the top 500 by frequency; 5,000 samples per category are drawn uniformly and validity is measured on the first 128 bytes generated. We will also note that the rare>common pattern holds under alternative sampling (e.g., frequency-weighted) and report exact counts. revision: yes
Circularity Check
No significant circularity; claims rest on direct empirical counts.
full rationale
The paper reports direct counts of valid UTF-8 byte sequences during training checkpoints and context-free generation. These measurements do not reduce to any fitted parameter, self-defined quantity, or self-citation chain. The introduced protocols are presented as independent evaluation methods that separate structural validity from perplexity; no equation or definition in the provided text shows the validity metric being constructed from the perplexity values or vice versa. The factor-of-two lag and rare-vs-common observations are therefore empirical findings rather than tautological restatements of inputs.
Axiom & Free-Parameter Ledger
free parameters (2)
- model_size
- training_scale
axioms (1)
- standard math UTF-8 is a well-defined variable-length byte encoding whose structural validity can be checked independently of semantic content.
Reference graph
Works this paper leans on
-
[1]
URL http://arxiv. org/abs/2305.13245. arXiv:2305.13245 [cs]. Cognetta, M. and Okazaki, N. Tokenization as finite-state transduction.Computational Linguistics, 51(4):1119– 1149, December
-
[2]
URL https://aclanthology.org/2025.cl-4.2/
doi: 10.1162/coli.a.23. URL https://aclanthology.org/2025.cl-4.2/. Geh, R. L., Shao, Z., and Broeck, G. V . d. Adversarial Tokenization, June
-
[3]
URLhttp://arxiv.org/abs/ 2503.02174. arXiv:2503.02174 [cs]. Gillick, D., Brunk, C., Vinyals, O., and Subramanya, A. Multilingual language processing from bytes. In Knight, K., Nenkova, A., and Rambow, O. (eds.),Proceedings of the 2016 Conference of the North American Chap- ter of the Association for Computational Linguistics: Human Language Technologies, ...
arXiv 2016
-
[4]
Association for Compu- tational Linguistics. doi: 10.18653/v1/N16-1155. URL https://aclanthology.org/N16-1155/. Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., Casas, D. d. L., Hendricks, L. A., Welbl, J., Clark, A., Hennigan, T., Noland, E., Millican, K., Driessche, G. v. d., Damoc, B., Guy, A., Osindero, S., Simonyan, K...
-
[5]
URL http://arxiv.org/abs/ 2203.15556. arXiv:2203.15556 [cs]. Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., and Amodei, D. Scaling Laws for Neural Language Mod- els, January
-
[6]
URL http://arxiv.org/abs/2001. 08361. arXiv:2001.08361 [cs]. Koo, T., Liu, F., and He, L. Automata-based constraints for language model decoding. InConference on Language Modeling (COLM),
Pith/arXiv arXiv 2001
-
[7]
URL https://openreview. net/forum?id=BDBdblmyzY. arXiv:2407.08103. Land, S. and Arnett, C. BPE stays on SCRIPT: Struc- tured encoding for robust multilingual pretokeniza- tion. InICML 2025 Workshop on Tokenization (Tok- Shop),
arXiv 2025
-
[8]
URL https://openreview.net/forum? id=AO78CqwaUO. Land, S. and Bartolo, M. Fishing for magikarp: Auto- matically detecting under-trained tokens in large lan- guage models. In Al-Onaizan, Y ., Bansal, M., and Chen, Y .-N. (eds.),Proceedings of the 2024 Conference on Empirical Methods in Natural Language Process- ing, pp. 11631–11646, Miami, Florida, USA, Novem- ber
2024
-
[9]
Fishing for Magikarp: Automatically Detecting Under-trained Tokens in Large Language Models
Association for Computational Linguistics. doi: 10.18653/v1/2024.emnlp-main.649. URL https: //aclanthology.org/2024.emnlp-main.649/. Li, Y ., Liu, Y ., Deng, G., Zhang, Y ., Song, W., Shi, L., Wang, K., Li, Y ., Liu, Y ., and Wang, H. Glitch Tokens in Large Language Models: Categorization Taxonomy and Effective Detection, April
-
[10]
URL http://arxiv. org/abs/2404.09894. arXiv:2404.09894 [cs]. Limisiewicz, T., Blevins, T., Gonen, H., Ahia, O., and Zettlemoyer, L. MYTE: Morphology-driven byte en- coding for better and fairer multilingual language mod- eling. In Ku, L.-W., Martins, A., and Srikumar, V . (eds.),Proceedings of the 62nd Annual Meeting of the Association for Computational L...
-
[11]
doi: 10.18653/v1/2024.acl-long.804
Association for Computational Linguis- tics. doi: 10.18653/v1/2024.acl-long.804. URL https: //aclanthology.org/2024.acl-long.804/. Pagnoni, A., Pasunuru, R., Rodriguez, P., Nguyen, J., Muller, B., Li, M., Zhou, C., Yu, L., Weston, J., Zettle- moyer, L., Ghosh, G., Lewis, M., Holtzman, A., and Iyer, S. Byte Latent Transformer: Patches Scale Better Than Tok...
-
[12]
URLhttp://arxiv.org/abs/ 2412.09871. arXiv:2412.09871 [cs]. Penedo, G., Kydl ´ıˇcek, H., allal, L. B., Lozhkov, A., Mitchell, M., Raffel, C., V on Werra, L., and Wolf, T. The fineweb datasets: Decanting the web for the finest text data at scale. In Globerson, A., Mackey, L., Belgrave, D., Fan, A., Paquet, U., Tom- czak, J., and Zhang, C. (eds.),Advances i...
-
[13]
doi: 10.52202/079017-0970. URL https://proceedings. neurips.cc/paper files/paper/2024/file/ 370df50ccfdf8bde18f8f9c2d9151bda-Paper-Datasets and Benchmarks Track.pdf. 10 Beyond Perplexity: UTF-8 Validity in Byte-aware Language Models Pokharel, R., Nezhad, S. B., Agrawal, A., and Singh, S. The Impact of Model Scaling on Seen and Unseen Language Performance, January
-
[14]
URL http://arxiv.org/ abs/2501.05629. arXiv:2501.05629 [cs]. Rust, P., Pfeiffer, J., Vuli ´c, I., Ruder, S., and Gurevych, I. How good is your tokenizer? on the monolin- gual performance of multilingual language models. In Zong, C., Xia, F., Li, W., and Navigli, R. (eds.),Pro- ceedings of the 59th Annual Meeting of the Associa- tion for Computational Ling...
-
[15]
12 Oleksiy Syvokon and Mariana Romanyshyn
Association for Computational Lin- guistics. doi: 10.18653/v1/2021.acl-long.243. URL https://aclanthology.org/2021.acl-long.243/. Shazeer, N. GLU Variants Improve Transformer, Febru- ary
-
[16]
GLU Variants Improve Transformer
URL http://arxiv.org/abs/2002.05202. arXiv:2002.05202 [cs]. Singh, S., Vargus, F., D’souza, D., Karlsson, B., Mahendi- ran, A., Ko, W.-Y ., Shandilya, H., Patel, J., Mataciunas, D., O’Mahony, L., Zhang, M., Hettiarachchi, R., Wil- son, J., Machado, M., Moura, L., Krzemi´nski, D., Fadaei, H., Ergun, I., Okoh, I., Alaagib, A., Mudannayake, O., Alyafeai, Z.,...
work page internal anchor Pith review Pith/arXiv arXiv 2002
-
[17]
doi: 10.18653/v1/2024.acl-long.620
Association for Computational Linguistics. doi: 10.18653/v1/2024.acl-long.620. URL https://aclanthology.org/2024.acl-long.620/. Su, J., Ahmed, M., Lu, Y ., Pan, S., Bo, W., and Liu, Y . Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063,
-
[18]
Szegedy, C., Vanhoucke, V ., Ioffe, S., Shlens, J., and Wojna, Z
doi: 10.1016/j.neucom.2023.127063. URL https://arxiv. org/abs/2104.09864. Team, G., Riviere, M., Pathak, S., Sessa, P. G., Hardin, C., Bhupatiraju, S., Hussenot, L., Mesnard, T., Shahriari, B., Ram´e, A., Ferret, J., Liu, P., Tafti, P., Friesen, A., Casbon, M., Ramos, S., Kumar, R., Lan, C. L., Jerome, S., Tsit- sulin, A., Vieillard, N., Stanczyk, P., Gir...
-
[19]
Gemma 2: Improving Open Language Models at a Practical Size
URL http://arxiv.org/abs/2408.00118. arXiv:2408.00118 [cs]. Wang, C., Cho, K., and Gu, J. Neural machine translation with byte-level subwords. InProceedings of the AAAI conference on artificial intelligence, volume 34, pp. 9154– 9160,
work page internal anchor Pith review Pith/arXiv arXiv
-
[20]
Willard, B. T. and Louf, R. Efficient guided generation for large language models.arXiv preprint arXiv:2307.09702,
work page internal anchor Pith review Pith/arXiv arXiv
-
[21]
Lost in the middle: How language models use long contexts
doi: 10.1162/tacl a 00461. URLhttps://aclanthology.org/2022.tacl-1.17/. Zhang, B. and Sennrich, R. Root Mean Square Layer Nor- malization, October
work page internal anchor Pith review doi:10.1162/tacl 2022
-
[22]
URL http://arxiv.org/ abs/1910.07467. arXiv:1910.07467 [cs]. 11 Beyond Perplexity: UTF-8 Validity in Byte-aware Language Models 0 1000 2000 3000 4000 5000 6000 7000 8000 9.8 9.9 10.0 10.1 10.2T oken Percentage English 0 1000 2000 3000 4000 5000 6000 7000 8000 29.8 30.0 30.2 Japanese 0 1000 2000 3000 4000 5000 6000 7000 8000 Data Batch 29.8 30.0 30.2T oken...
Pith/arXiv arXiv 1910
-
[23]
The feed-forward blocks use a gated MLP (GeGLU variant) (Shazeer, 2020)
with a reduced number of key/value heads to lower memory usage (similar in spirit to recent implementations such as Gemma 2 (Team et al., 2024)). The feed-forward blocks use a gated MLP (GeGLU variant) (Shazeer, 2020). We further apply query scaling by d−0.5 head and embedding scaling by √dhidden. Training was conducted on a single node with 8 Nvidia B200...
2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.