pith. sign in

arxiv: 2606.14122 · v2 · pith:ARIOSVOZnew · submitted 2026-06-12 · 💻 cs.CL

Beyond Perplexity: UTF-8 Validity in Byte-aware Language Models

Pith reviewed 2026-06-27 05:13 UTC · model grok-4.3

classification 💻 cs.CL
keywords byte-level tokenizationUTF-8 validitylanguage model trainingmultilingual corpusgeneration evaluationperplexity
0
0 comments X

The pith

Byte-level language models need twice as much training data for valid UTF-8 output as they do to minimize perplexity.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines how byte-aware language models can process any Unicode input yet still produce invalid UTF-8 sequences during generation. It introduces evaluation protocols that separate structural encoding validity from standard language modeling metrics. Training a 355M model on a balanced multilingual corpus shows that perplexity stabilizes after 2.1 billion tokens while UTF-8 validity continues improving until 4.2 billion tokens. The work also reports that context-free generation yields higher validity for rare characters than for common ones, indicating over-specialization on frequent representations. Reliable UTF-8 generation therefore emerges as a distinct capability that perplexity alone does not capture.

Core claim

UTF-8 validity convergence lags perplexity by a factor of two, with perplexity stabilizing after 2.1B tokens but UTF-8 validity requiring 4.2B tokens; in context-free generation, rare characters achieve higher structural validity than common characters, suggesting over-specialization of frequent character representations.

What carries the argument

Evaluation protocols that isolate UTF-8 structural validity from language modeling performance.

If this is right

  • Perplexity scores alone do not guarantee reliable UTF-8 generation in byte-level models.
  • Additional training beyond typical perplexity convergence points is required for encoding fidelity.
  • Rare characters receive less over-specialization and therefore produce more valid sequences than common characters.
  • Byte-level models require separate assessment of structural validity in addition to standard metrics.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Dedicated objectives that penalize invalid byte sequences could reduce the observed training lag.
  • The rare-versus-common difference may appear in other structural properties such as consistent punctuation or script switching.
  • Models trained on less balanced corpora could exhibit even larger gaps between perplexity and validity.

Load-bearing premise

The introduced evaluation protocols successfully isolate UTF-8 structural validity from ordinary language-modeling performance.

What would settle it

A replication experiment with altered model size, sampling procedure, or corpus balance that shows UTF-8 validity converging at the same token count as perplexity.

Figures

Figures reproduced from arXiv: 2606.14122 by Daisuke Oba, Naoaki Okazaki, Sangwhan Moon, Tatsuya Hiraoka, Youmi Ma.

Figure 1
Figure 1. Figure 1: An example of valid vs. invalid byte sequences. The invalid case cannot be decoded by a UTF-8 codec. have practical implications: models that appear well-trained by perplexity may still produce invalid UTF-8 sequences in context-sparse generation. 2. Problem Statement Modern large language models face a fundamental chal￾lenge in generating valid Unicode byte sequences, particu￾larly when encountering rare … view at source ↗
Figure 2
Figure 2. Figure 2: Simplified UTF-8 DFA with dashed red error transitions [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Per-tier plots for partial-credit validity. Common (blue), Uncommon (green), Rare (orange), Unseen (red). 5. Results Two evaluation protocols isolate UTF-8 generation capa￾bility from general language modeling. Level 0 tests com￾pletion of valid UTF-8 byte sequences without contextual cues; Level 1 tests retrieval of correct byte sequences when semantic context constrains the target character. Results span… view at source ↗
Figure 4
Figure 4. Figure 4: Side-by-side comparison of learning dynamics. The left column shows the baseline (L0) and the right column shows the context-guided setting (L1). Results of Chinese, Japanese, and Korean are plotted in green, red, and blue, respectively. Note how the partial credit validity (top row) stabilize significantly faster in the L1 setting. 48.2% validity, but produces 0% valid UTF-8 from the novel 11110xxx lead. … view at source ↗
Figure 5
Figure 5. Figure 5: Term match rate. Results of Chinese, Japanese, and Korean are plotted in green, red, and blue, respectively. with similar radical structure or meaning. Byte-level syntax and byte-level semantics are distinct capabilities, with the latter requiring more training. The diagnostic ∆LL (Eq. 5) splits the semantic failures fur￾ther. Among Level 1 failures where ∆LL > 0 at the final checkpoint (51 cases), the mod… view at source ↗
Figure 6
Figure 6. Figure 6: Token distribution convergence during training data construction. The adaptive weight adjustment algorithm dynamically modifies language sampling probabilities to achieve the target distribution while preserving natural sentence boundaries. A. Ethical Considerations This research investigates potential vulnerabilities in language model tokenization that could be exploited for adversarial purposes. However,… view at source ↗
Figure 7
Figure 7. Figure 7: Side-by-side comparison of other validity metrics evaluated. The left column shows the baseline (L0) and the right column shows the context-guided setting (L1). Results of Chinese, Japanese, and Korean are plotted in green, red, and blue, respectively. Note how L0 in general is very unstable, while L1 is relatively more stable. H.1. Binary Score The binary score awards credit only for completely valid sequ… view at source ↗
read the original abstract

Byte-level tokenization enables language models to handle any Unicode input, but models can generate invalid UTF-8 sequences when encountering rare or unseen characters. We investigate the relationship between training scale and UTF-8 generation reliability with a 355M parameter model trained on 80B tokens from a balanced multilingual corpus of English, Japanese, Korean, and Chinese. We introduce multiple evaluation protocols that isolate UTF-8 structural validity from language modeling. UTF-8 validity convergence lags perplexity by a roughly a factor of two: perplexity stabilizes after 2.1B tokens, but UTF-8 validity requires 4.2B tokens. In context-free generation, rare characters achieve higher structural validity than common characters, suggesting over-specialization of frequent character representations. Through experiments, we observed that reliable UTF-8 generation is a distinct capability requiring evaluation beyond perplexity.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims that byte-level LMs can produce invalid UTF-8 despite handling any Unicode input, and that with a 355M model trained on 80B tokens from a balanced multilingual corpus (English, Japanese, Korean, Chinese), UTF-8 validity convergence lags perplexity by a factor of roughly two (perplexity stabilizes after 2.1B tokens, UTF-8 validity after 4.2B). It further claims that in context-free generation rare characters achieve higher structural validity than common ones, suggesting over-specialization, and that reliable UTF-8 generation is a distinct capability requiring evaluation protocols beyond perplexity.

Significance. If the evaluation protocols truly isolate UTF-8 structural validity from standard LM performance, the result would show that perplexity alone is insufficient to guarantee reliable generation in byte-aware multilingual models and that training scale requirements differ for structural validity. The empirical observation of a lag and the rare-vs-common reversal would be a useful data point for byte-level modeling, though the single 355M model size and 80B-token setup limit immediate generalizability.

major comments (2)
  1. [Abstract / protocol description] Abstract and the paragraphs describing the protocols: the central claims (factor-of-two lag between 2.1B and 4.2B tokens; rare > common validity) rest on the assertion that the introduced protocols isolate UTF-8 structural validity from ordinary language-modeling performance, yet no definitions are supplied for validity scoring, convergence criteria, statistical tests, run-to-run variance, or data-exclusion rules. Without these, the lag and the over-specialization interpretation cannot be verified and could be artifacts of the particular corpus balance or sampling procedure.
  2. [Context-free generation experiments] Results on context-free generation: the claim that rare characters achieve higher structural validity than common characters is presented without reporting the prompting method, temperature, context length, or how 'rare' vs 'common' characters were sampled and counted. This leaves open whether the observed difference is driven by the generation procedure rather than a general property of byte-level representations.
minor comments (1)
  1. [Abstract] The abstract states numerical thresholds (2.1B, 4.2B) without indicating how stabilization was operationalized; a brief parenthetical definition would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We agree that the evaluation protocols require more explicit definitions and that the context-free generation experiments need additional methodological details to ensure reproducibility. We will incorporate these changes in a major revision.

read point-by-point responses
  1. Referee: [Abstract / protocol description] Abstract and the paragraphs describing the protocols: the central claims (factor-of-two lag between 2.1B and 4.2B tokens; rare > common validity) rest on the assertion that the introduced protocols isolate UTF-8 structural validity from ordinary language-modeling performance, yet no definitions are supplied for validity scoring, convergence criteria, statistical tests, run-to-run variance, or data-exclusion rules. Without these, the lag and the over-specialization interpretation cannot be verified and could be artifacts of the particular corpus balance or sampling procedure.

    Authors: We agree that the manuscript lacks sufficient detail on these aspects. In the revised version, we will add a dedicated 'Evaluation Protocols' subsection that defines: UTF-8 validity scoring as the fraction of byte sequences that successfully decode to valid UTF-8 (using Python's decode with errors='strict'); convergence as the training step where a 100M-token moving average changes by less than 0.5% over three consecutive checkpoints; statistical significance via paired t-tests (p<0.05) across three independent runs; run-to-run variance reported as standard deviation; and data-exclusion rules (sequences with >5% control bytes are excluded from validity computation). These additions will allow independent verification of the reported lag and rare-vs-common reversal. revision: yes

  2. Referee: [Context-free generation experiments] Results on context-free generation: the claim that rare characters achieve higher structural validity than common characters is presented without reporting the prompting method, temperature, context length, or how 'rare' vs 'common' characters were sampled and counted. This leaves open whether the observed difference is driven by the generation procedure rather than a general property of byte-level representations.

    Authors: We acknowledge the omission of these details. The revision will expand the relevant section to specify: context-free generation uses an empty prompt (no prefix tokens); temperature=1.0 with top-p=0.95; generation starts from a 0-token context; 'rare' characters are defined as those with frequency <100 in the 80B-token corpus while 'common' are the top 500 by frequency; 5,000 samples per category are drawn uniformly and validity is measured on the first 128 bytes generated. We will also note that the rare>common pattern holds under alternative sampling (e.g., frequency-weighted) and report exact counts. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on direct empirical counts.

full rationale

The paper reports direct counts of valid UTF-8 byte sequences during training checkpoints and context-free generation. These measurements do not reduce to any fitted parameter, self-defined quantity, or self-citation chain. The introduced protocols are presented as independent evaluation methods that separate structural validity from perplexity; no equation or definition in the provided text shows the validity metric being constructed from the perplexity values or vice versa. The factor-of-two lag and rare-vs-common observations are therefore empirical findings rather than tautological restatements of inputs.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The paper is an empirical training study; it introduces no new theoretical entities or derivations. The only background assumptions are the standard definition of UTF-8 validity and the usual language-model training setup.

free parameters (2)
  • model_size
    355M parameters chosen for the experiment; not fitted to the validity result.
  • training_scale
    80B tokens and the 2.1B / 4.2B checkpoints are experimental design choices.
axioms (1)
  • standard math UTF-8 is a well-defined variable-length byte encoding whose structural validity can be checked independently of semantic content.
    Invoked when defining the evaluation protocols that isolate structural validity.

pith-pipeline@v0.9.1-grok · 5689 in / 1422 out tokens · 26686 ms · 2026-06-27T05:13:20.434304+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

23 extracted references · 13 canonical work pages · 4 internal anchors

  1. [1]

    org/abs/2305.13245

    URL http://arxiv. org/abs/2305.13245. arXiv:2305.13245 [cs]. Cognetta, M. and Okazaki, N. Tokenization as finite-state transduction.Computational Linguistics, 51(4):1119– 1149, December

  2. [2]

    URL https://aclanthology.org/2025.cl-4.2/

    doi: 10.1162/coli.a.23. URL https://aclanthology.org/2025.cl-4.2/. Geh, R. L., Shao, Z., and Broeck, G. V . d. Adversarial Tokenization, June

  3. [3]

    arXiv:2503.02174 [cs]

    URLhttp://arxiv.org/abs/ 2503.02174. arXiv:2503.02174 [cs]. Gillick, D., Brunk, C., Vinyals, O., and Subramanya, A. Multilingual language processing from bytes. In Knight, K., Nenkova, A., and Rambow, O. (eds.),Proceedings of the 2016 Conference of the North American Chap- ter of the Association for Computational Linguistics: Human Language Technologies, ...

  4. [4]

    doi: 10.18653/v1/N16-1155

    Association for Compu- tational Linguistics. doi: 10.18653/v1/N16-1155. URL https://aclanthology.org/N16-1155/. Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., Casas, D. d. L., Hendricks, L. A., Welbl, J., Clark, A., Hennigan, T., Noland, E., Millican, K., Driessche, G. v. d., Damoc, B., Guy, A., Osindero, S., Simonyan, K...

  5. [5]

    arXiv:2203.15556 [cs]

    URL http://arxiv.org/abs/ 2203.15556. arXiv:2203.15556 [cs]. Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., and Amodei, D. Scaling Laws for Neural Language Mod- els, January

  6. [6]

    URL http://arxiv.org/abs/2001. 08361. arXiv:2001.08361 [cs]. Koo, T., Liu, F., and He, L. Automata-based constraints for language model decoding. InConference on Language Modeling (COLM),

  7. [7]

    net/forum?id=BDBdblmyzY

    URL https://openreview. net/forum?id=BDBdblmyzY. arXiv:2407.08103. Land, S. and Arnett, C. BPE stays on SCRIPT: Struc- tured encoding for robust multilingual pretokeniza- tion. InICML 2025 Workshop on Tokenization (Tok- Shop),

  8. [8]

    URL https://openreview.net/forum? id=AO78CqwaUO. Land, S. and Bartolo, M. Fishing for magikarp: Auto- matically detecting under-trained tokens in large lan- guage models. In Al-Onaizan, Y ., Bansal, M., and Chen, Y .-N. (eds.),Proceedings of the 2024 Conference on Empirical Methods in Natural Language Process- ing, pp. 11631–11646, Miami, Florida, USA, Novem- ber

  9. [9]

    Fishing for Magikarp: Automatically Detecting Under-trained Tokens in Large Language Models

    Association for Computational Linguistics. doi: 10.18653/v1/2024.emnlp-main.649. URL https: //aclanthology.org/2024.emnlp-main.649/. Li, Y ., Liu, Y ., Deng, G., Zhang, Y ., Song, W., Shi, L., Wang, K., Li, Y ., Liu, Y ., and Wang, H. Glitch Tokens in Large Language Models: Categorization Taxonomy and Effective Detection, April

  10. [10]

    org/abs/2404.09894

    URL http://arxiv. org/abs/2404.09894. arXiv:2404.09894 [cs]. Limisiewicz, T., Blevins, T., Gonen, H., Ahia, O., and Zettlemoyer, L. MYTE: Morphology-driven byte en- coding for better and fairer multilingual language mod- eling. In Ku, L.-W., Martins, A., and Srikumar, V . (eds.),Proceedings of the 62nd Annual Meeting of the Association for Computational L...

  11. [11]

    doi: 10.18653/v1/2024.acl-long.804

    Association for Computational Linguis- tics. doi: 10.18653/v1/2024.acl-long.804. URL https: //aclanthology.org/2024.acl-long.804/. Pagnoni, A., Pasunuru, R., Rodriguez, P., Nguyen, J., Muller, B., Li, M., Zhou, C., Yu, L., Weston, J., Zettle- moyer, L., Ghosh, G., Lewis, M., Holtzman, A., and Iyer, S. Byte Latent Transformer: Patches Scale Better Than Tok...

  12. [12]

    Pagnoni, R

    URLhttp://arxiv.org/abs/ 2412.09871. arXiv:2412.09871 [cs]. Penedo, G., Kydl ´ıˇcek, H., allal, L. B., Lozhkov, A., Mitchell, M., Raffel, C., V on Werra, L., and Wolf, T. The fineweb datasets: Decanting the web for the finest text data at scale. In Globerson, A., Mackey, L., Belgrave, D., Fan, A., Paquet, U., Tom- czak, J., and Zhang, C. (eds.),Advances i...

  13. [13]

    2024 , booktitle =

    doi: 10.52202/079017-0970. URL https://proceedings. neurips.cc/paper files/paper/2024/file/ 370df50ccfdf8bde18f8f9c2d9151bda-Paper-Datasets and Benchmarks Track.pdf. 10 Beyond Perplexity: UTF-8 Validity in Byte-aware Language Models Pokharel, R., Nezhad, S. B., Agrawal, A., and Singh, S. The Impact of Model Scaling on Seen and Unseen Language Performance, January

  14. [14]

    arXiv:2501.05629 [cs]

    URL http://arxiv.org/ abs/2501.05629. arXiv:2501.05629 [cs]. Rust, P., Pfeiffer, J., Vuli ´c, I., Ruder, S., and Gurevych, I. How good is your tokenizer? on the monolin- gual performance of multilingual language models. In Zong, C., Xia, F., Li, W., and Navigli, R. (eds.),Pro- ceedings of the 59th Annual Meeting of the Associa- tion for Computational Ling...

  15. [15]

    12 Oleksiy Syvokon and Mariana Romanyshyn

    Association for Computational Lin- guistics. doi: 10.18653/v1/2021.acl-long.243. URL https://aclanthology.org/2021.acl-long.243/. Shazeer, N. GLU Variants Improve Transformer, Febru- ary

  16. [16]

    GLU Variants Improve Transformer

    URL http://arxiv.org/abs/2002.05202. arXiv:2002.05202 [cs]. Singh, S., Vargus, F., D’souza, D., Karlsson, B., Mahendi- ran, A., Ko, W.-Y ., Shandilya, H., Patel, J., Mataciunas, D., O’Mahony, L., Zhang, M., Hettiarachchi, R., Wil- son, J., Machado, M., Moura, L., Krzemi´nski, D., Fadaei, H., Ergun, I., Okoh, I., Alaagib, A., Mudannayake, O., Alyafeai, Z.,...

  17. [17]

    doi: 10.18653/v1/2024.acl-long.620

    Association for Computational Linguistics. doi: 10.18653/v1/2024.acl-long.620. URL https://aclanthology.org/2024.acl-long.620/. Su, J., Ahmed, M., Lu, Y ., Pan, S., Bo, W., and Liu, Y . Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063,

  18. [18]

    RoFormer: Enhanced transformer with Rotary Position Embedding , journal =

    doi: 10.1016/j.neucom.2023.127063. URL https://arxiv. org/abs/2104.09864. Team, G., Riviere, M., Pathak, S., Sessa, P. G., Hardin, C., Bhupatiraju, S., Hussenot, L., Mesnard, T., Shahriari, B., Ram´e, A., Ferret, J., Liu, P., Tafti, P., Friesen, A., Casbon, M., Ramos, S., Kumar, R., Lan, C. L., Jerome, S., Tsit- sulin, A., Vieillard, N., Stanczyk, P., Gir...

  19. [19]

    Gemma 2: Improving Open Language Models at a Practical Size

    URL http://arxiv.org/abs/2408.00118. arXiv:2408.00118 [cs]. Wang, C., Cho, K., and Gu, J. Neural machine translation with byte-level subwords. InProceedings of the AAAI conference on artificial intelligence, volume 34, pp. 9154– 9160,

  20. [20]

    Willard, B. T. and Louf, R. Efficient guided generation for large language models.arXiv preprint arXiv:2307.09702,

  21. [21]

    Thomas McCoy, Paul Smolensky, Tal Linzen, Jianfeng Gao, and Asli Celikyilmaz

    doi: 10.1162/tacl a 00461. URLhttps://aclanthology.org/2022.tacl-1.17/. Zhang, B. and Sennrich, R. Root Mean Square Layer Nor- malization, October

  22. [22]

    arXiv:1910.07467 [cs]

    URL http://arxiv.org/ abs/1910.07467. arXiv:1910.07467 [cs]. 11 Beyond Perplexity: UTF-8 Validity in Byte-aware Language Models 0 1000 2000 3000 4000 5000 6000 7000 8000 9.8 9.9 10.0 10.1 10.2T oken Percentage English 0 1000 2000 3000 4000 5000 6000 7000 8000 29.8 30.0 30.2 Japanese 0 1000 2000 3000 4000 5000 6000 7000 8000 Data Batch 29.8 30.0 30.2T oken...

  23. [23]

    The feed-forward blocks use a gated MLP (GeGLU variant) (Shazeer, 2020)

    with a reduced number of key/value heads to lower memory usage (similar in spirit to recent implementations such as Gemma 2 (Team et al., 2024)). The feed-forward blocks use a gated MLP (GeGLU variant) (Shazeer, 2020). We further apply query scaling by d−0.5 head and embedding scaling by √dhidden. Training was conducted on a single node with 8 Nvidia B200...