Beyond Perplexity: UTF-8 Validity in Byte-aware Language Models

Daisuke Oba; Naoaki Okazaki; Sangwhan Moon; Tatsuya Hiraoka; Youmi Ma

arxiv: 2606.14122 · v2 · pith:ARIOSVOZnew · submitted 2026-06-12 · 💻 cs.CL

Beyond Perplexity: UTF-8 Validity in Byte-aware Language Models

Sangwhan Moon , Daisuke Oba , Youmi Ma , Tatsuya Hiraoka , Naoaki Okazaki This is my paper

Pith reviewed 2026-06-27 05:13 UTC · model grok-4.3

classification 💻 cs.CL

keywords byte-level tokenizationUTF-8 validitylanguage model trainingmultilingual corpusgeneration evaluationperplexity

0 comments

The pith

Byte-level language models need twice as much training data for valid UTF-8 output as they do to minimize perplexity.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines how byte-aware language models can process any Unicode input yet still produce invalid UTF-8 sequences during generation. It introduces evaluation protocols that separate structural encoding validity from standard language modeling metrics. Training a 355M model on a balanced multilingual corpus shows that perplexity stabilizes after 2.1 billion tokens while UTF-8 validity continues improving until 4.2 billion tokens. The work also reports that context-free generation yields higher validity for rare characters than for common ones, indicating over-specialization on frequent representations. Reliable UTF-8 generation therefore emerges as a distinct capability that perplexity alone does not capture.

Core claim

UTF-8 validity convergence lags perplexity by a factor of two, with perplexity stabilizing after 2.1B tokens but UTF-8 validity requiring 4.2B tokens; in context-free generation, rare characters achieve higher structural validity than common characters, suggesting over-specialization of frequent character representations.

What carries the argument

Evaluation protocols that isolate UTF-8 structural validity from language modeling performance.

If this is right

Perplexity scores alone do not guarantee reliable UTF-8 generation in byte-level models.
Additional training beyond typical perplexity convergence points is required for encoding fidelity.
Rare characters receive less over-specialization and therefore produce more valid sequences than common characters.
Byte-level models require separate assessment of structural validity in addition to standard metrics.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Dedicated objectives that penalize invalid byte sequences could reduce the observed training lag.
The rare-versus-common difference may appear in other structural properties such as consistent punctuation or script switching.
Models trained on less balanced corpora could exhibit even larger gaps between perplexity and validity.

Load-bearing premise

The introduced evaluation protocols successfully isolate UTF-8 structural validity from ordinary language-modeling performance.

What would settle it

A replication experiment with altered model size, sampling procedure, or corpus balance that shows UTF-8 validity converging at the same token count as perplexity.

Figures

Figures reproduced from arXiv: 2606.14122 by Daisuke Oba, Naoaki Okazaki, Sangwhan Moon, Tatsuya Hiraoka, Youmi Ma.

**Figure 1.** Figure 1: An example of valid vs. invalid byte sequences. The invalid case cannot be decoded by a UTF-8 codec. have practical implications: models that appear well-trained by perplexity may still produce invalid UTF-8 sequences in context-sparse generation. 2. Problem Statement Modern large language models face a fundamental challenge in generating valid Unicode byte sequences, particularly when encountering rare … view at source ↗

**Figure 2.** Figure 2: Simplified UTF-8 DFA with dashed red error transitions [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Per-tier plots for partial-credit validity. Common (blue), Uncommon (green), Rare (orange), Unseen (red). 5. Results Two evaluation protocols isolate UTF-8 generation capability from general language modeling. Level 0 tests completion of valid UTF-8 byte sequences without contextual cues; Level 1 tests retrieval of correct byte sequences when semantic context constrains the target character. Results span… view at source ↗

**Figure 4.** Figure 4: Side-by-side comparison of learning dynamics. The left column shows the baseline (L0) and the right column shows the context-guided setting (L1). Results of Chinese, Japanese, and Korean are plotted in green, red, and blue, respectively. Note how the partial credit validity (top row) stabilize significantly faster in the L1 setting. 48.2% validity, but produces 0% valid UTF-8 from the novel 11110xxx lead. … view at source ↗

**Figure 5.** Figure 5: Term match rate. Results of Chinese, Japanese, and Korean are plotted in green, red, and blue, respectively. with similar radical structure or meaning. Byte-level syntax and byte-level semantics are distinct capabilities, with the latter requiring more training. The diagnostic ∆LL (Eq. 5) splits the semantic failures further. Among Level 1 failures where ∆LL > 0 at the final checkpoint (51 cases), the mod… view at source ↗

**Figure 6.** Figure 6: Token distribution convergence during training data construction. The adaptive weight adjustment algorithm dynamically modifies language sampling probabilities to achieve the target distribution while preserving natural sentence boundaries. A. Ethical Considerations This research investigates potential vulnerabilities in language model tokenization that could be exploited for adversarial purposes. However,… view at source ↗

**Figure 7.** Figure 7: Side-by-side comparison of other validity metrics evaluated. The left column shows the baseline (L0) and the right column shows the context-guided setting (L1). Results of Chinese, Japanese, and Korean are plotted in green, red, and blue, respectively. Note how L0 in general is very unstable, while L1 is relatively more stable. H.1. Binary Score The binary score awards credit only for completely valid sequ… view at source ↗

read the original abstract

Byte-level tokenization enables language models to handle any Unicode input, but models can generate invalid UTF-8 sequences when encountering rare or unseen characters. We investigate the relationship between training scale and UTF-8 generation reliability with a 355M parameter model trained on 80B tokens from a balanced multilingual corpus of English, Japanese, Korean, and Chinese. We introduce multiple evaluation protocols that isolate UTF-8 structural validity from language modeling. UTF-8 validity convergence lags perplexity by a roughly a factor of two: perplexity stabilizes after 2.1B tokens, but UTF-8 validity requires 4.2B tokens. In context-free generation, rare characters achieve higher structural validity than common characters, suggesting over-specialization of frequent character representations. Through experiments, we observed that reliable UTF-8 generation is a distinct capability requiring evaluation beyond perplexity.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

UTF-8 validity in this byte-level model lags perplexity by a factor of two and favors rare characters, but the isolation from standard metrics needs more proof.

read the letter

The main point is that UTF-8 structural validity takes roughly twice as long to stabilize as perplexity in their 355M byte-level model, and context-free generation actually produces more valid sequences for rare characters than for common ones.

They train on a balanced multilingual corpus covering English, Japanese, Korean, and Chinese up to 80B tokens and track both metrics over training. Perplexity plateaus around 2.1B tokens while valid UTF-8 output requires about 4.2B. The rare-character result runs counter to the usual expectation that frequent items are handled better, and they read it as over-specialization on common patterns. The work introduces protocols meant to measure validity separately from ordinary language modeling performance.

This adds a concrete empirical distinction that prior byte-level papers have not quantified in the same way. It is useful to see that perplexity alone misses a generation reliability issue that matters for multilingual byte models.

The main weakness is the lack of detail on the protocols themselves. The abstract states the lag and the rare-common split but does not spell out exact scoring rules, sampling parameters, context lengths, or variance across runs. With results from only one model size and one corpus balance, it is unclear whether the factor-of-two gap is a general property or tied to this specific setup. The concern that the validity measure might still share variance with perplexity is reasonable until the methods are shown in full.

The paper is aimed at people working on byte-level tokenization or multilingual evaluation. Anyone running generation tests on similar models would want to know about the extra tokens needed for reliable output. It is worth sending to peer review because the central observation is direct and points to a practical evaluation gap, even if the manuscript needs added protocol descriptions and robustness checks to stand on its own.

Referee Report

2 major / 1 minor

Summary. The paper claims that byte-level LMs can produce invalid UTF-8 despite handling any Unicode input, and that with a 355M model trained on 80B tokens from a balanced multilingual corpus (English, Japanese, Korean, Chinese), UTF-8 validity convergence lags perplexity by a factor of roughly two (perplexity stabilizes after 2.1B tokens, UTF-8 validity after 4.2B). It further claims that in context-free generation rare characters achieve higher structural validity than common ones, suggesting over-specialization, and that reliable UTF-8 generation is a distinct capability requiring evaluation protocols beyond perplexity.

Significance. If the evaluation protocols truly isolate UTF-8 structural validity from standard LM performance, the result would show that perplexity alone is insufficient to guarantee reliable generation in byte-aware multilingual models and that training scale requirements differ for structural validity. The empirical observation of a lag and the rare-vs-common reversal would be a useful data point for byte-level modeling, though the single 355M model size and 80B-token setup limit immediate generalizability.

major comments (2)

[Abstract / protocol description] Abstract and the paragraphs describing the protocols: the central claims (factor-of-two lag between 2.1B and 4.2B tokens; rare > common validity) rest on the assertion that the introduced protocols isolate UTF-8 structural validity from ordinary language-modeling performance, yet no definitions are supplied for validity scoring, convergence criteria, statistical tests, run-to-run variance, or data-exclusion rules. Without these, the lag and the over-specialization interpretation cannot be verified and could be artifacts of the particular corpus balance or sampling procedure.
[Context-free generation experiments] Results on context-free generation: the claim that rare characters achieve higher structural validity than common characters is presented without reporting the prompting method, temperature, context length, or how 'rare' vs 'common' characters were sampled and counted. This leaves open whether the observed difference is driven by the generation procedure rather than a general property of byte-level representations.

minor comments (1)

[Abstract] The abstract states numerical thresholds (2.1B, 4.2B) without indicating how stabilization was operationalized; a brief parenthetical definition would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We agree that the evaluation protocols require more explicit definitions and that the context-free generation experiments need additional methodological details to ensure reproducibility. We will incorporate these changes in a major revision.

read point-by-point responses

Referee: [Abstract / protocol description] Abstract and the paragraphs describing the protocols: the central claims (factor-of-two lag between 2.1B and 4.2B tokens; rare > common validity) rest on the assertion that the introduced protocols isolate UTF-8 structural validity from ordinary language-modeling performance, yet no definitions are supplied for validity scoring, convergence criteria, statistical tests, run-to-run variance, or data-exclusion rules. Without these, the lag and the over-specialization interpretation cannot be verified and could be artifacts of the particular corpus balance or sampling procedure.

Authors: We agree that the manuscript lacks sufficient detail on these aspects. In the revised version, we will add a dedicated 'Evaluation Protocols' subsection that defines: UTF-8 validity scoring as the fraction of byte sequences that successfully decode to valid UTF-8 (using Python's decode with errors='strict'); convergence as the training step where a 100M-token moving average changes by less than 0.5% over three consecutive checkpoints; statistical significance via paired t-tests (p<0.05) across three independent runs; run-to-run variance reported as standard deviation; and data-exclusion rules (sequences with >5% control bytes are excluded from validity computation). These additions will allow independent verification of the reported lag and rare-vs-common reversal. revision: yes
Referee: [Context-free generation experiments] Results on context-free generation: the claim that rare characters achieve higher structural validity than common characters is presented without reporting the prompting method, temperature, context length, or how 'rare' vs 'common' characters were sampled and counted. This leaves open whether the observed difference is driven by the generation procedure rather than a general property of byte-level representations.

Authors: We acknowledge the omission of these details. The revision will expand the relevant section to specify: context-free generation uses an empty prompt (no prefix tokens); temperature=1.0 with top-p=0.95; generation starts from a 0-token context; 'rare' characters are defined as those with frequency <100 in the 80B-token corpus while 'common' are the top 500 by frequency; 5,000 samples per category are drawn uniformly and validity is measured on the first 128 bytes generated. We will also note that the rare>common pattern holds under alternative sampling (e.g., frequency-weighted) and report exact counts. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on direct empirical counts.

full rationale

The paper reports direct counts of valid UTF-8 byte sequences during training checkpoints and context-free generation. These measurements do not reduce to any fitted parameter, self-defined quantity, or self-citation chain. The introduced protocols are presented as independent evaluation methods that separate structural validity from perplexity; no equation or definition in the provided text shows the validity metric being constructed from the perplexity values or vice versa. The factor-of-two lag and rare-vs-common observations are therefore empirical findings rather than tautological restatements of inputs.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The paper is an empirical training study; it introduces no new theoretical entities or derivations. The only background assumptions are the standard definition of UTF-8 validity and the usual language-model training setup.

free parameters (2)

model_size
355M parameters chosen for the experiment; not fitted to the validity result.
training_scale
80B tokens and the 2.1B / 4.2B checkpoints are experimental design choices.

axioms (1)

standard math UTF-8 is a well-defined variable-length byte encoding whose structural validity can be checked independently of semantic content.
Invoked when defining the evaluation protocols that isolate structural validity.

pith-pipeline@v0.9.1-grok · 5689 in / 1422 out tokens · 26686 ms · 2026-06-27T05:13:20.434304+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

23 extracted references · 13 canonical work pages · 4 internal anchors

[1]

org/abs/2305.13245

URL http://arxiv. org/abs/2305.13245. arXiv:2305.13245 [cs]. Cognetta, M. and Okazaki, N. Tokenization as finite-state transduction.Computational Linguistics, 51(4):1119– 1149, December

Pith/arXiv arXiv
[2]

URL https://aclanthology.org/2025.cl-4.2/

doi: 10.1162/coli.a.23. URL https://aclanthology.org/2025.cl-4.2/. Geh, R. L., Shao, Z., and Broeck, G. V . d. Adversarial Tokenization, June

work page doi:10.1162/coli.a.23 2025
[3]

arXiv:2503.02174 [cs]

URLhttp://arxiv.org/abs/ 2503.02174. arXiv:2503.02174 [cs]. Gillick, D., Brunk, C., Vinyals, O., and Subramanya, A. Multilingual language processing from bytes. In Knight, K., Nenkova, A., and Rambow, O. (eds.),Proceedings of the 2016 Conference of the North American Chap- ter of the Association for Computational Linguistics: Human Language Technologies, ...

arXiv 2016
[4]

doi: 10.18653/v1/N16-1155

Association for Compu- tational Linguistics. doi: 10.18653/v1/N16-1155. URL https://aclanthology.org/N16-1155/. Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., Casas, D. d. L., Hendricks, L. A., Welbl, J., Clark, A., Hennigan, T., Noland, E., Millican, K., Driessche, G. v. d., Damoc, B., Guy, A., Osindero, S., Simonyan, K...

work page doi:10.18653/v1/n16-1155
[5]

arXiv:2203.15556 [cs]

URL http://arxiv.org/abs/ 2203.15556. arXiv:2203.15556 [cs]. Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., and Amodei, D. Scaling Laws for Neural Language Mod- els, January

Pith/arXiv arXiv
[6]

URL http://arxiv.org/abs/2001. 08361. arXiv:2001.08361 [cs]. Koo, T., Liu, F., and He, L. Automata-based constraints for language model decoding. InConference on Language Modeling (COLM),

Pith/arXiv arXiv 2001
[7]

net/forum?id=BDBdblmyzY

URL https://openreview. net/forum?id=BDBdblmyzY. arXiv:2407.08103. Land, S. and Arnett, C. BPE stays on SCRIPT: Struc- tured encoding for robust multilingual pretokeniza- tion. InICML 2025 Workshop on Tokenization (Tok- Shop),

arXiv 2025
[8]

URL https://openreview.net/forum? id=AO78CqwaUO. Land, S. and Bartolo, M. Fishing for magikarp: Auto- matically detecting under-trained tokens in large lan- guage models. In Al-Onaizan, Y ., Bansal, M., and Chen, Y .-N. (eds.),Proceedings of the 2024 Conference on Empirical Methods in Natural Language Process- ing, pp. 11631–11646, Miami, Florida, USA, Novem- ber

2024
[9]

Fishing for Magikarp: Automatically Detecting Under-trained Tokens in Large Language Models

Association for Computational Linguistics. doi: 10.18653/v1/2024.emnlp-main.649. URL https: //aclanthology.org/2024.emnlp-main.649/. Li, Y ., Liu, Y ., Deng, G., Zhang, Y ., Song, W., Shi, L., Wang, K., Li, Y ., Liu, Y ., and Wang, H. Glitch Tokens in Large Language Models: Categorization Taxonomy and Effective Detection, April

work page doi:10.18653/v1/2024.emnlp-main.649 2024
[10]

org/abs/2404.09894

URL http://arxiv. org/abs/2404.09894. arXiv:2404.09894 [cs]. Limisiewicz, T., Blevins, T., Gonen, H., Ahia, O., and Zettlemoyer, L. MYTE: Morphology-driven byte en- coding for better and fairer multilingual language mod- eling. In Ku, L.-W., Martins, A., and Srikumar, V . (eds.),Proceedings of the 62nd Annual Meeting of the Association for Computational L...

arXiv
[11]

doi: 10.18653/v1/2024.acl-long.804

Association for Computational Linguis- tics. doi: 10.18653/v1/2024.acl-long.804. URL https: //aclanthology.org/2024.acl-long.804/. Pagnoni, A., Pasunuru, R., Rodriguez, P., Nguyen, J., Muller, B., Li, M., Zhou, C., Yu, L., Weston, J., Zettle- moyer, L., Ghosh, G., Lewis, M., Holtzman, A., and Iyer, S. Byte Latent Transformer: Patches Scale Better Than Tok...

work page doi:10.18653/v1/2024.acl-long.804 2024
[12]

Pagnoni, R

URLhttp://arxiv.org/abs/ 2412.09871. arXiv:2412.09871 [cs]. Penedo, G., Kydl ´ıˇcek, H., allal, L. B., Lozhkov, A., Mitchell, M., Raffel, C., V on Werra, L., and Wolf, T. The fineweb datasets: Decanting the web for the finest text data at scale. In Globerson, A., Mackey, L., Belgrave, D., Fan, A., Paquet, U., Tom- czak, J., and Zhang, C. (eds.),Advances i...

work page arXiv
[13]

2024 , booktitle =

doi: 10.52202/079017-0970. URL https://proceedings. neurips.cc/paper files/paper/2024/file/ 370df50ccfdf8bde18f8f9c2d9151bda-Paper-Datasets and Benchmarks Track.pdf. 10 Beyond Perplexity: UTF-8 Validity in Byte-aware Language Models Pokharel, R., Nezhad, S. B., Agrawal, A., and Singh, S. The Impact of Model Scaling on Seen and Unseen Language Performance, January

work page doi:10.52202/079017-0970 2024
[14]

arXiv:2501.05629 [cs]

URL http://arxiv.org/ abs/2501.05629. arXiv:2501.05629 [cs]. Rust, P., Pfeiffer, J., Vuli ´c, I., Ruder, S., and Gurevych, I. How good is your tokenizer? on the monolin- gual performance of multilingual language models. In Zong, C., Xia, F., Li, W., and Navigli, R. (eds.),Pro- ceedings of the 59th Annual Meeting of the Associa- tion for Computational Ling...

arXiv
[15]

12 Oleksiy Syvokon and Mariana Romanyshyn

Association for Computational Lin- guistics. doi: 10.18653/v1/2021.acl-long.243. URL https://aclanthology.org/2021.acl-long.243/. Shazeer, N. GLU Variants Improve Transformer, Febru- ary

work page doi:10.18653/v1/2021.acl-long.243 2021
[16]

GLU Variants Improve Transformer

URL http://arxiv.org/abs/2002.05202. arXiv:2002.05202 [cs]. Singh, S., Vargus, F., D’souza, D., Karlsson, B., Mahendi- ran, A., Ko, W.-Y ., Shandilya, H., Patel, J., Mataciunas, D., O’Mahony, L., Zhang, M., Hettiarachchi, R., Wil- son, J., Machado, M., Moura, L., Krzemi´nski, D., Fadaei, H., Ergun, I., Okoh, I., Alaagib, A., Mudannayake, O., Alyafeai, Z.,...

work page internal anchor Pith review Pith/arXiv arXiv 2002
[17]

doi: 10.18653/v1/2024.acl-long.620

Association for Computational Linguistics. doi: 10.18653/v1/2024.acl-long.620. URL https://aclanthology.org/2024.acl-long.620/. Su, J., Ahmed, M., Lu, Y ., Pan, S., Bo, W., and Liu, Y . Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063,

work page doi:10.18653/v1/2024.acl-long.620 2024
[18]

RoFormer: Enhanced transformer with Rotary Position Embedding , journal =

doi: 10.1016/j.neucom.2023.127063. URL https://arxiv. org/abs/2104.09864. Team, G., Riviere, M., Pathak, S., Sessa, P. G., Hardin, C., Bhupatiraju, S., Hussenot, L., Mesnard, T., Shahriari, B., Ram´e, A., Ferret, J., Liu, P., Tafti, P., Friesen, A., Casbon, M., Ramos, S., Kumar, R., Lan, C. L., Jerome, S., Tsit- sulin, A., Vieillard, N., Stanczyk, P., Gir...

work page doi:10.1016/j.neucom.2023.127063 2023
[19]

Gemma 2: Improving Open Language Models at a Practical Size

URL http://arxiv.org/abs/2408.00118. arXiv:2408.00118 [cs]. Wang, C., Cho, K., and Gu, J. Neural machine translation with byte-level subwords. InProceedings of the AAAI conference on artificial intelligence, volume 34, pp. 9154– 9160,

work page internal anchor Pith review Pith/arXiv arXiv
[20]

Willard, B. T. and Louf, R. Efficient guided generation for large language models.arXiv preprint arXiv:2307.09702,

work page internal anchor Pith review Pith/arXiv arXiv
[21]

Thomas McCoy, Paul Smolensky, Tal Linzen, Jianfeng Gao, and Asli Celikyilmaz

doi: 10.1162/tacl a 00461. URLhttps://aclanthology.org/2022.tacl-1.17/. Zhang, B. and Sennrich, R. Root Mean Square Layer Nor- malization, October

work page internal anchor Pith review doi:10.1162/tacl 2022
[22]

arXiv:1910.07467 [cs]

URL http://arxiv.org/ abs/1910.07467. arXiv:1910.07467 [cs]. 11 Beyond Perplexity: UTF-8 Validity in Byte-aware Language Models 0 1000 2000 3000 4000 5000 6000 7000 8000 9.8 9.9 10.0 10.1 10.2T oken Percentage English 0 1000 2000 3000 4000 5000 6000 7000 8000 29.8 30.0 30.2 Japanese 0 1000 2000 3000 4000 5000 6000 7000 8000 Data Batch 29.8 30.0 30.2T oken...

Pith/arXiv arXiv 1910
[23]

The feed-forward blocks use a gated MLP (GeGLU variant) (Shazeer, 2020)

with a reduced number of key/value heads to lower memory usage (similar in spirit to recent implementations such as Gemma 2 (Team et al., 2024)). The feed-forward blocks use a gated MLP (GeGLU variant) (Shazeer, 2020). We further apply query scaling by d−0.5 head and embedding scaling by √dhidden. Training was conducted on a single node with 8 Nvidia B200...

2024

[1] [1]

org/abs/2305.13245

URL http://arxiv. org/abs/2305.13245. arXiv:2305.13245 [cs]. Cognetta, M. and Okazaki, N. Tokenization as finite-state transduction.Computational Linguistics, 51(4):1119– 1149, December

Pith/arXiv arXiv

[2] [2]

URL https://aclanthology.org/2025.cl-4.2/

doi: 10.1162/coli.a.23. URL https://aclanthology.org/2025.cl-4.2/. Geh, R. L., Shao, Z., and Broeck, G. V . d. Adversarial Tokenization, June

work page doi:10.1162/coli.a.23 2025

[3] [3]

arXiv:2503.02174 [cs]

URLhttp://arxiv.org/abs/ 2503.02174. arXiv:2503.02174 [cs]. Gillick, D., Brunk, C., Vinyals, O., and Subramanya, A. Multilingual language processing from bytes. In Knight, K., Nenkova, A., and Rambow, O. (eds.),Proceedings of the 2016 Conference of the North American Chap- ter of the Association for Computational Linguistics: Human Language Technologies, ...

arXiv 2016

[4] [4]

doi: 10.18653/v1/N16-1155

Association for Compu- tational Linguistics. doi: 10.18653/v1/N16-1155. URL https://aclanthology.org/N16-1155/. Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., Casas, D. d. L., Hendricks, L. A., Welbl, J., Clark, A., Hennigan, T., Noland, E., Millican, K., Driessche, G. v. d., Damoc, B., Guy, A., Osindero, S., Simonyan, K...

work page doi:10.18653/v1/n16-1155

[5] [5]

arXiv:2203.15556 [cs]

URL http://arxiv.org/abs/ 2203.15556. arXiv:2203.15556 [cs]. Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., and Amodei, D. Scaling Laws for Neural Language Mod- els, January

Pith/arXiv arXiv

[6] [6]

URL http://arxiv.org/abs/2001. 08361. arXiv:2001.08361 [cs]. Koo, T., Liu, F., and He, L. Automata-based constraints for language model decoding. InConference on Language Modeling (COLM),

Pith/arXiv arXiv 2001

[7] [7]

net/forum?id=BDBdblmyzY

URL https://openreview. net/forum?id=BDBdblmyzY. arXiv:2407.08103. Land, S. and Arnett, C. BPE stays on SCRIPT: Struc- tured encoding for robust multilingual pretokeniza- tion. InICML 2025 Workshop on Tokenization (Tok- Shop),

arXiv 2025

[8] [8]

URL https://openreview.net/forum? id=AO78CqwaUO. Land, S. and Bartolo, M. Fishing for magikarp: Auto- matically detecting under-trained tokens in large lan- guage models. In Al-Onaizan, Y ., Bansal, M., and Chen, Y .-N. (eds.),Proceedings of the 2024 Conference on Empirical Methods in Natural Language Process- ing, pp. 11631–11646, Miami, Florida, USA, Novem- ber

2024

[9] [9]

Fishing for Magikarp: Automatically Detecting Under-trained Tokens in Large Language Models

Association for Computational Linguistics. doi: 10.18653/v1/2024.emnlp-main.649. URL https: //aclanthology.org/2024.emnlp-main.649/. Li, Y ., Liu, Y ., Deng, G., Zhang, Y ., Song, W., Shi, L., Wang, K., Li, Y ., Liu, Y ., and Wang, H. Glitch Tokens in Large Language Models: Categorization Taxonomy and Effective Detection, April

work page doi:10.18653/v1/2024.emnlp-main.649 2024

[10] [10]

org/abs/2404.09894

URL http://arxiv. org/abs/2404.09894. arXiv:2404.09894 [cs]. Limisiewicz, T., Blevins, T., Gonen, H., Ahia, O., and Zettlemoyer, L. MYTE: Morphology-driven byte en- coding for better and fairer multilingual language mod- eling. In Ku, L.-W., Martins, A., and Srikumar, V . (eds.),Proceedings of the 62nd Annual Meeting of the Association for Computational L...

arXiv

[11] [11]

doi: 10.18653/v1/2024.acl-long.804

Association for Computational Linguis- tics. doi: 10.18653/v1/2024.acl-long.804. URL https: //aclanthology.org/2024.acl-long.804/. Pagnoni, A., Pasunuru, R., Rodriguez, P., Nguyen, J., Muller, B., Li, M., Zhou, C., Yu, L., Weston, J., Zettle- moyer, L., Ghosh, G., Lewis, M., Holtzman, A., and Iyer, S. Byte Latent Transformer: Patches Scale Better Than Tok...

work page doi:10.18653/v1/2024.acl-long.804 2024

[12] [12]

Pagnoni, R

URLhttp://arxiv.org/abs/ 2412.09871. arXiv:2412.09871 [cs]. Penedo, G., Kydl ´ıˇcek, H., allal, L. B., Lozhkov, A., Mitchell, M., Raffel, C., V on Werra, L., and Wolf, T. The fineweb datasets: Decanting the web for the finest text data at scale. In Globerson, A., Mackey, L., Belgrave, D., Fan, A., Paquet, U., Tom- czak, J., and Zhang, C. (eds.),Advances i...

work page arXiv

[13] [13]

2024 , booktitle =

doi: 10.52202/079017-0970. URL https://proceedings. neurips.cc/paper files/paper/2024/file/ 370df50ccfdf8bde18f8f9c2d9151bda-Paper-Datasets and Benchmarks Track.pdf. 10 Beyond Perplexity: UTF-8 Validity in Byte-aware Language Models Pokharel, R., Nezhad, S. B., Agrawal, A., and Singh, S. The Impact of Model Scaling on Seen and Unseen Language Performance, January

work page doi:10.52202/079017-0970 2024

[14] [14]

arXiv:2501.05629 [cs]

URL http://arxiv.org/ abs/2501.05629. arXiv:2501.05629 [cs]. Rust, P., Pfeiffer, J., Vuli ´c, I., Ruder, S., and Gurevych, I. How good is your tokenizer? on the monolin- gual performance of multilingual language models. In Zong, C., Xia, F., Li, W., and Navigli, R. (eds.),Pro- ceedings of the 59th Annual Meeting of the Associa- tion for Computational Ling...

arXiv

[15] [15]

12 Oleksiy Syvokon and Mariana Romanyshyn

Association for Computational Lin- guistics. doi: 10.18653/v1/2021.acl-long.243. URL https://aclanthology.org/2021.acl-long.243/. Shazeer, N. GLU Variants Improve Transformer, Febru- ary

work page doi:10.18653/v1/2021.acl-long.243 2021

[16] [16]

GLU Variants Improve Transformer

URL http://arxiv.org/abs/2002.05202. arXiv:2002.05202 [cs]. Singh, S., Vargus, F., D’souza, D., Karlsson, B., Mahendi- ran, A., Ko, W.-Y ., Shandilya, H., Patel, J., Mataciunas, D., O’Mahony, L., Zhang, M., Hettiarachchi, R., Wil- son, J., Machado, M., Moura, L., Krzemi´nski, D., Fadaei, H., Ergun, I., Okoh, I., Alaagib, A., Mudannayake, O., Alyafeai, Z.,...

work page internal anchor Pith review Pith/arXiv arXiv 2002

[17] [17]

doi: 10.18653/v1/2024.acl-long.620

Association for Computational Linguistics. doi: 10.18653/v1/2024.acl-long.620. URL https://aclanthology.org/2024.acl-long.620/. Su, J., Ahmed, M., Lu, Y ., Pan, S., Bo, W., and Liu, Y . Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063,

work page doi:10.18653/v1/2024.acl-long.620 2024

[18] [18]

RoFormer: Enhanced transformer with Rotary Position Embedding , journal =

doi: 10.1016/j.neucom.2023.127063. URL https://arxiv. org/abs/2104.09864. Team, G., Riviere, M., Pathak, S., Sessa, P. G., Hardin, C., Bhupatiraju, S., Hussenot, L., Mesnard, T., Shahriari, B., Ram´e, A., Ferret, J., Liu, P., Tafti, P., Friesen, A., Casbon, M., Ramos, S., Kumar, R., Lan, C. L., Jerome, S., Tsit- sulin, A., Vieillard, N., Stanczyk, P., Gir...

work page doi:10.1016/j.neucom.2023.127063 2023

[19] [19]

Gemma 2: Improving Open Language Models at a Practical Size

URL http://arxiv.org/abs/2408.00118. arXiv:2408.00118 [cs]. Wang, C., Cho, K., and Gu, J. Neural machine translation with byte-level subwords. InProceedings of the AAAI conference on artificial intelligence, volume 34, pp. 9154– 9160,

work page internal anchor Pith review Pith/arXiv arXiv

[20] [20]

Willard, B. T. and Louf, R. Efficient guided generation for large language models.arXiv preprint arXiv:2307.09702,

work page internal anchor Pith review Pith/arXiv arXiv

[21] [21]

Thomas McCoy, Paul Smolensky, Tal Linzen, Jianfeng Gao, and Asli Celikyilmaz

doi: 10.1162/tacl a 00461. URLhttps://aclanthology.org/2022.tacl-1.17/. Zhang, B. and Sennrich, R. Root Mean Square Layer Nor- malization, October

work page internal anchor Pith review doi:10.1162/tacl 2022

[22] [22]

arXiv:1910.07467 [cs]

URL http://arxiv.org/ abs/1910.07467. arXiv:1910.07467 [cs]. 11 Beyond Perplexity: UTF-8 Validity in Byte-aware Language Models 0 1000 2000 3000 4000 5000 6000 7000 8000 9.8 9.9 10.0 10.1 10.2T oken Percentage English 0 1000 2000 3000 4000 5000 6000 7000 8000 29.8 30.0 30.2 Japanese 0 1000 2000 3000 4000 5000 6000 7000 8000 Data Batch 29.8 30.0 30.2T oken...

Pith/arXiv arXiv 1910

[23] [23]

The feed-forward blocks use a gated MLP (GeGLU variant) (Shazeer, 2020)

with a reduced number of key/value heads to lower memory usage (similar in spirit to recent implementations such as Gemma 2 (Team et al., 2024)). The feed-forward blocks use a gated MLP (GeGLU variant) (Shazeer, 2020). We further apply query scaling by d−0.5 head and embedding scaling by √dhidden. Training was conducted on a single node with 8 Nvidia B200...

2024