BrahmicTokenizer-131K: An Indic-Capable Drop-In Replacement for o200k_base

Rohan Shravan

arxiv: 2605.29379 · v1 · pith:EHMTTSDRnew · submitted 2026-05-28 · 💻 cs.CL · cs.LG

BrahmicTokenizer-131K: An Indic-Capable Drop-In Replacement for o200k_base

Rohan Shravan This is my paper

Pith reviewed 2026-06-29 08:09 UTC · model grok-4.3

classification 💻 cs.CL cs.LG

keywords Brahmic tokenizerIndic languagesBPE tokenizervocabulary allocationdrop-in replacementtoken compressionmultilingual NLPo200k_base

0 comments

The pith

BrahmicTokenizer-131K reduces Indic token counts by 26.7% while matching o200k_base on English and code.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper constructs a 131072-token BPE tokenizer by pruning o200k_base to remove nine non-Indic scripts and reallocating slots to Brahmic Unicode blocks. This retrofit keeps the pre-tokenizer, decoder, and merge rules identical to the original. On large Indic datasets it produces substantially fewer tokens than competing 131K tokenizers. It matches the original's English fertility and outperforms alternatives on code and math benchmarks. A reader cares because it offers a practical way to add Indic support to models without retraining the tokenizer from scratch or losing other capabilities.

Core claim

BrahmicTokenizer-131K is constructed via a script-prune crop that reduces the vocabulary from 200019 to 131072 tokens by removing nine out-of-scope writing systems, followed by a linear-programming allocation of 2372 corpus-dead slots to nine Brahmic Unicode blocks. The pre-tokenizer, decoder, and inherited merge rules remain unchanged from o200k_base. On 27 million Indic documents it yields 26.7% fewer tokens than Mistral-Nemo Tekken or Sarvam-m while matching English fertility at 1.235 tokens per word and improving on HumanEval, MBPP, and GSM8K.

What carries the argument

The two-stage retrofit consisting of a script-prune crop and linear-programming allocation of vocabulary slots to Brahmic blocks, preserving all other components of o200k_base.

If this is right

Achieves 26.7% token reduction on Indic pretraining text compared to other 131K tokenizers.
Maintains equivalent English fertility to o200k_base.
Outperforms Tekken/Sarvam-m on code and math benchmarks by 4-14%.
Is the only tokenizer competitive across Brahmic, English, EU languages, code, and math at the 131K vocabulary size.
Specialist Indic tokenizers at other sizes show worse English and code performance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This approach could be generalized to retrofit tokenizers for other underrepresented scripts without full retraining.
The drop-in replacement property allows immediate use in existing model pipelines for Indic languages.
Better token efficiency on Indic text may reduce training costs for multilingual models targeting South Asian languages.
The linear programming allocation method might be adapted for other vocabulary optimization problems.

Load-bearing premise

That the removal of nine scripts and the reallocation of slots to Brahmic blocks leaves the inherited merge rules and pre-tokenizer equally effective on non-Indic content.

What would settle it

A direct comparison showing that BrahmicTokenizer-131K produces more than 1.235 tokens per word on a standard English test set or lower scores on HumanEval than o200k_base would falsify the preservation claim.

read the original abstract

We present BrahmicTokenizer-131K, a 131,072-vocabulary byte-level BPE tokenizer that closes the Brahmic compression gap at the 131K-vocabulary class while preserving the English, EU-language, and code compression of OpenAI's o200k_base. We construct it through a two-stage retrofit: (1) a script-prune crop that reduces 200,019 tokens to 131,072 by removing nine out-of-scope writing systems, and (2) a surgical retrofit of 2,372 corpus-dead vocabulary slots determined by linear-programming allocation across nine Brahmic Unicode blocks. The pre-tokenizer, decoder, and inherited merge rules are unchanged from o200k_base, making BrahmicTokenizer-131K a drop-in replacement at the tokenizer interface. On 27 million documents of public Indic pretraining text (2.84 billion words, 46.21 GB), BrahmicTokenizer-131K produces 26.7% fewer tokens than Mistral-Nemo Tekken / Sarvam-m at the same vocabulary budget, with per-language savings of 15.79% (Tamil) to 76.79% (Odia, a 4.31x compression ratio). The Odia advantage is mechanistically explained by Tekken/Sarvam-m containing zero Oriya-block tokens; our surgery added 725. On non-Indic content, BrahmicTokenizer-131K matches o200k_base's English fertility (1.235 vs 1.232 tokens/word) and beats Tekken/Sarvam-m by 4.0-14.2% on HumanEval, MBPP, and GSM8K. Across our 14-tokenizer benchmark, it is the only tokenizer simultaneously competitive on Brahmic, English, EU, code, and math at the 131K budget. Specialist tokenizers at other vocab classes (Sarvam-30B, Sarvam-1, MUTANT-Indic) achieve better Indic compression at the cost of non-Indic performance: Sarvam-1's English fertility is 15.9% worse and its code/math compression 26-33% worse than ours. We release the artifact under Apache 2.0 at https://huggingface.co/theschoolofai/BrahmicTokenizer-131K.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The retrofit gives a drop-in 131K tokenizer that cuts Indic token counts by 26.7% while matching o200k_base on English fertility and code benchmarks.

read the letter

The main takeaway is that script-prune plus linear-programming slot allocation produces a usable replacement for o200k_base that improves Brahmic compression without changing the merge rules or pre-tokenizer. The paper reports 26.7% fewer tokens than Tekken/Sarvam-m on 27M Indic documents, with Odia seeing a 4.31x gain because the new tokenizer added 725 Oriya-block entries that the baseline lacked. English fertility stays at 1.235 tokens per word versus 1.232, and it beats the alternatives on HumanEval, MBPP, and GSM8K.

What stands out is the explicit two-stage procedure at fixed vocabulary size. Pruning nine out-of-scope scripts frees slots, then LP assigns 2,372 of them to nine Brahmic blocks. Keeping the inherited merges and decoder makes the drop-in claim testable, and the direct fertility measurement on English text addresses the obvious worry that pruning would force extra splits downstream.

The soft spots are the usual ones for this kind of engineering note. The abstract gives no error bars, no description of the LP objective or constraints, and no ablation of the prune step itself. The 27M-document corpus is described only at high level, so it is hard to judge how much the savings depend on the particular sample. Those gaps are real but not load-bearing for the central claim, since the English match and benchmark wins are reported as direct comparisons.

The stress-test concern about merge paths on non-Indic text does not appear to land here; the measured fertility numbers already test that the pruned set is effectively inert for the reported English data. This paper is for people who need better Indic coverage inside an existing tokenizer budget and are willing to accept an engineering retrofit rather than a full redesign. A reader building multilingual pretraining pipelines would find the released artifact and the per-language breakdowns useful. It deserves a serious referee because the construction is reproducible from the description and the comparisons use public data.

Referee Report

3 major / 1 minor

Summary. The paper claims to construct BrahmicTokenizer-131K, a 131072-vocab byte-level BPE drop-in replacement for o200k_base, via a two-stage process of script-pruning nine out-of-scope writing systems followed by linear-programming allocation of 2372 corpus-dead slots to nine Brahmic Unicode blocks. It reports 26.7% fewer tokens than Mistral-Nemo Tekken/Sarvam-m on 27M Indic documents (2.84B words) while matching o200k_base English fertility (1.235 vs 1.232 tokens/word) and outperforming alternatives on HumanEval, MBPP, and GSM8K; it positions itself as the only 131K tokenizer competitive across Brahmic, English, EU, code, and math.

Significance. If the central claims hold, the work supplies a practical, openly released (Apache 2.0) tokenizer that narrows the Indic compression gap at fixed vocabulary budget without measurable degradation on English/EU/code/math, supported by large-scale empirical comparisons across 14 tokenizers. The explicit retrofit procedure and artifact release are concrete strengths that aid reproducibility in multilingual pretraining.

major comments (3)

[Abstract / Methods] Abstract and Methods: the linear-programming allocation of exactly 2372 slots across nine Brahmic blocks is described only at the level of 'determined by linear-programming allocation' with no statement of the objective function, constraints, or solver; this is load-bearing for the retrofit step that underpins both the Indic gains and the claim of unchanged non-Indic behavior.
[Results] Results: the headline 26.7% token reduction (and per-language figures 15.79%–76.79%) on the 27M-document corpus supplies neither error bars, variance estimates, nor any description of sampling or deduplication; without these the robustness of the central compression claim cannot be assessed.
[Methods] Methods: the claim that 'the pre-tokenizer, decoder, and inherited merge rules are unchanged' and therefore non-Indic effectiveness is preserved is not accompanied by any verification that the pruned tokens are merge-inert on English/EU/code text or that the new Brahmic entries do not alter existing merge paths on mixed-script data; this directly bears on the drop-in replacement guarantee.

minor comments (1)

[Abstract] Abstract: the statement that BrahmicTokenizer-131K 'beats alternatives on HumanEval, MBPP, and GSM8K' would be clearer if the exact percentage improvements and the identity of the 'alternatives' were stated in the same sentence.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below. Revisions will be incorporated to improve the manuscript's clarity, reproducibility, and empirical robustness.

read point-by-point responses

Referee: [Abstract / Methods] Abstract and Methods: the linear-programming allocation of exactly 2372 slots across nine Brahmic blocks is described only at the level of 'determined by linear-programming allocation' with no statement of the objective function, constraints, or solver; this is load-bearing for the retrofit step that underpins both the Indic gains and the claim of unchanged non-Indic behavior.

Authors: We agree that the linear-programming details require explicit specification for reproducibility. In the revised manuscript we will add a dedicated Methods subsection stating the objective (minimize expected fertility on a held-out Indic validation corpus), the constraints (non-negative integers summing exactly to 2372, with per-block upper bounds derived from character-frequency statistics), and the solver (PuLP with CBC). This directly addresses the load-bearing nature of the allocation step. revision: yes
Referee: [Results] Results: the headline 26.7% token reduction (and per-language figures 15.79%–76.79%) on the 27M-document corpus supplies neither error bars, variance estimates, nor any description of sampling or deduplication; without these the robustness of the central compression claim cannot be assessed.

Authors: We accept that statistical robustness measures are currently missing. The 27M-document corpus was obtained by sampling public Indic web data followed by MinHash deduplication (Jaccard threshold 0.8). In revision we will add the sampling and deduplication description plus bootstrap standard errors (500 resamples) for the overall and per-language token-reduction figures. revision: yes
Referee: [Methods] Methods: the claim that 'the pre-tokenizer, decoder, and inherited merge rules are unchanged' and therefore non-Indic effectiveness is preserved is not accompanied by any verification that the pruned tokens are merge-inert on English/EU/code text or that the new Brahmic entries do not alter existing merge paths on mixed-script data; this directly bears on the drop-in replacement guarantee.

Authors: The drop-in property follows from the unchanged pre-tokenizer and merge table together with the fact that new tokens occupy previously zero-frequency IDs. Nevertheless, to provide the requested empirical verification we will add an appendix containing tokenization comparisons on English-only, code, and mixed-script corpora demonstrating identical output sequences for non-Brahmic content. This will be included in the revision. revision: yes

Circularity Check

0 steps flagged

No circularity; explicit construction validated by external empirical benchmarks

full rationale

The paper presents a deterministic two-stage retrofit procedure (script-prune crop followed by linear-programming slot allocation) whose outputs are then measured directly against independent tokenizers on public Indic and non-Indic corpora. Fertility ratios, token counts, and benchmark scores are computed from raw text, not derived from any fitted parameter that is later renamed as a prediction. No self-citations appear in the load-bearing claims, and the unchanged merge rules are asserted as an implementation fact rather than a derived result. All reported advantages are falsifiable comparisons external to the construction itself.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The work rests on the standard byte-level BPE algorithm and the pre-existing o200k_base merge rules and pre-tokenizer as fixed inputs. The retrofit numbers (2,372 slots, nine Brahmic blocks) are the only quantities introduced by the paper itself.

free parameters (2)

Target vocabulary size = 131072
Fixed at 131072 to enable direct comparison inside the 131K class.
Number of retrofitted slots = 2372
Set by linear-programming allocation to replace corpus-dead entries with Brahmic tokens.

axioms (1)

domain assumption The inherited merge rules from o200k_base remain optimal after the script-prune and slot retrofit.
Invoked when the paper states that pre-tokenizer, decoder, and merge rules are unchanged.

pith-pipeline@v0.9.1-grok · 5967 in / 1641 out tokens · 49128 ms · 2026-06-29T08:09:36.853239+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Reversible Foundations: Training a 120B Sparse MoE through State-Preserving Scaling
cs.LG 2026-06 unverdicted novelty 4.0

A 120B sparse MoE model with 460 experts was trained on one 8-GPU node to loss 1.78 using reversible recurrence and state-preserving scaling from a 1.78B dense seed, with 5.93B active parameters.

Reference graph

Works this paper leans on

26 extracted references · 22 canonical work pages · cited by 1 Pith paper · 9 internal anchors

[1]

O. Ahia, S. Kumar, H. Gonen, J. Kasai, D. Mortensen, N. A. Smith, and Y . Tsvetkov. Do all languages cost the same? Tokenization in the era of commercial language models. In Proceedings of the 2023 Conference on Empirical Meth- ods in Natural Language Processing (EMNLP) ,

2023
[2]

arXiv:2305.13707

URL https://arxiv.org/abs/2305.13707. arXiv:2305.13707. Documents per-language pricing/latency costs induced by tokenizer fertility differentials. J. Austin, A. Odena, M. Nye, et al. Program synthesis with large language models,

work page arXiv
[3]

Program Synthesis with Large Language Models

URL https://arxiv. org/abs/2108.07732. MBPP-sanitized. A. Bendale, M. Sapienza, S. Ripplinger, S. Gibbs, J. Lee, and P . Mistry. SUTRA: Scalable multilingual language model architecture,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

arXiv:2405.06694

URL https://arxiv.org/abs/2405.06694. arXiv:2405.06694. Multilingual LLM architecture cited as comparison baseline by MUTANT ( Rana et al. , 2026). M. Chen, J. T worek, H. Jun, et al. Evaluating large language models trained on code,

work page arXiv 2026
[5]

Evaluating Large Language Models Trained on Code

URL https://arxiv. org/abs/2107.03374. HumanEval: 164 Python problems. K. Cobbe, V . Kosaraju, M. Bavarian, et al. Training verifiers to solve math word problems,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Training Verifiers to Solve Math Word Problems

URL https: //arxiv.org/abs/2110.14168. GSM8K: grade-school math word problems. M. R. Costa-jussà, J. Cross, O. Çelebi, M. Elbayad, K. Heafield, K. Heffernan, E. Kalbassi, J. Lam, D. Licht, J. Maillard, et al. No language left behind: Scaling human-centered machine translation,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

No Language Left Behind: Scaling Human-Centered Machine Translation

URL https: //arxiv.org/abs/2207.04672. arXiv:2207.04672. Releases the FLORES-200 dev+devtest benchmark (2009 sentences per language). A later version of this work appeared as Costa-jussà et al., Nature 630:841–846,

work page internal anchor Pith review Pith/arXiv arXiv 2009
[8]

arXiv:2304.08177

URL https://arxiv.org/abs/2304.08177. arXiv:2304.08177. Vocabulary expansion of LLaMA for Chi- nese; closest prior precedent for adding script coverage to an existing tokenizer via vocabulary surgery rather than from-scratch training. DeepSeek AI. DeepSeek-R1 model artifacts,

work page arXiv
[10]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

URL https://arxiv.org/abs/2501.12948. 18 BrahmicTokenizer-131K TECHNICAL REPORT S. Doddapaneni, M. S. U. R. Khan, D. Verma, A. Vasanthkumar, A. Lavania, A. Kunchukuttan, M. M. Khapra, and R. Dabre. Towards leaving no Indic language behind: Building monolingual corpora, benchmark and models for Indic languages. In Findings of the Association for Computatio...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[11]

arXiv:2212.05409

URL https://arxiv.org/abs/2212.05409. arXiv:2212.05409. Releases IndicBERT v2 / IndicNLG-Suite. P . Gage. A new algorithm for data compression. C Users Journal , 12(2):23–38,

work page arXiv
[12]

URL https://openreview.net/forum?id= vfT4YuzAYA

ISSN 2835-8856. URL https://openreview.net/forum?id= vfT4YuzAYA. arXiv:2305.16307. Introduces the IN22 benchmark suite including IN22-Gen. J. Gala, T. Jayakumar, J. A. Husain, A. Kumar M, M. S. U. R. Khan, D. Kanojia, R. Puduppully, M. M. Khapra, R. Dabre, R. Murthy, and A. Kunchukuttan. Airavata: Introducing Hindi instruction-tuned LLM. arXiv preprint ar...

work page arXiv
[13]

Gemma Team

URL https://arxiv.org/abs/2401.15006. Gemma Team. Gemma 3 technical report. arXiv preprint arXiv:2503.19786,

work page arXiv
[14]

Gemma 3 Technical Report

URL https://arxiv.org/ abs/2503.19786. Google DeepMind. Gemma 3 model artifacts,

work page internal anchor Pith review Pith/arXiv arXiv
[16]

The Llama 3 Herd of Models

URL https://arxiv.org/abs/2407.21783. Llama 3.1 family; tokenizer vocab size 128,256. É. Grave, A. Joulin, M. Cissé, D. Grangier, and H. Jégou. Eﬀicient softmax approximation for GPUs. In Proceedings of the 34th International Conference on Machine Learning (ICML),

work page internal anchor Pith review Pith/arXiv arXiv
[17]

Efficient softmax approximation for GPUs

URL https://arxiv.org/abs/ 1609.04309. arXiv:1609.04309. Adaptive softmax: frequency-bucketed output layers for large-vocabulary lan- guage models. Hugging Face. HuggingFace tokenizers library,

work page internal anchor Pith review Pith/arXiv arXiv
[18]

doi: 10.18653/v1/2024.acl-long.843

Association for Compu- tational Linguistics. doi: 10.18653/v1/2024.acl-long.843. URL https://aclanthology.org/2024. acl-long.843/. arXiv:2403.06350. ACL 2024 Outstanding Paper. Releases the Sangraha pre-training corpus (251B tokens, 22 Indic languages) and IndicAlign. Krutrim AI Labs. Krutrim-1-instruct: A multilingual base model and tokenizer for Indian ...

work page doi:10.18653/v1/2024.acl-long.843 2024
[19]

Krutrim LLM: Multilingual Foundational Model for over a Billion People

URL https://huggingface.co/krutrim-ai-labs/Krutrim-1-instruct . Vocab size 70,212, Indic-focused. Tokenization details discussed in arXiv:2407.12481 (Kallappa et al., “Krutrim LLM: Multilingual Foundational Model for over a Billion People”, 2024). A. Liu, J. Hayase, V . Hofmann, S. Oh, N. A. Smith, and Y . Choi. SuperBPE: Space travel for language models....

work page arXiv 2024
[20]

arXiv:2503.13423

URL https://arxiv.org/ abs/2503.13423. arXiv:2503.13423. Multi-word subword vocabulary learning. Meta AI. Llama 3.1 model artifacts,

work page arXiv
[21]

URL https://huggingface.co/meta-llama/Llama-3. 1-8B. HF model card for the Llama 3.1 8B base model. See Grattafiori et al. (2024) for the technical report. B. Minixhofer, F. Paischer, and N. Rekabsaz. WECHSEL: Effective initialization of subword embeddings for cross-lingual transfer of monolingual language models. In Proceedings of the 2022 Conference of ...

2024
[22]

arXiv:2112.06598

URL https://arxiv.org/abs/2112.06598. arXiv:2112.06598. Seminal work on transferring a pretrained model to a new tokenizer/language via embedding initialization. 19 BrahmicTokenizer-131K TECHNICAL REPORT Mistral AI. Mistral-Nemo-Base-2407: A 12B-parameter base model with the Tekken tokenizer,

work page arXiv
[23]

Released May 2024 with gpt-4o-2024-05-13

URL https://github.com/ openai/tiktoken/blob/main/tiktoken_ext/openai_public.py. Released May 2024 with gpt-4o-2024-05-13. Vocab size 200,019. OpenAI. GPT-OSS-120B: An open-weights language model from OpenAI,

2024
[24]

tokenization tax

URL https://arxiv.org/abs/ 2305.15425. arXiv:2305.15425. Quantifies cross-language tokenization disparities for closed and open tokeniz- ers; provides the framing for “tokenization tax” in low-resource languages. Qwen Team. Qwen3 model artifacts,

work page arXiv
[25]

HF model card

URL https://huggingface.co/Qwen/Qwen3-8B. HF model card. Tokenizer vocab size 151,669. See Y ang et al.(2025) for the technical report. A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever. Language models are unsuper- vised multitask learners. Technical report, OpenAI,

2025
[26]

arXiv:2104.05596

URL https://aclanthology.org/ 2022.tacl-1.9/. arXiv:2104.05596. S. Rana, A. Menezes, A. Kulkarni, C. Khatri, and S. Agarwal. MUTANT: A recipe for multilingual tokenizer design. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL) . Association for Computational Linguistics,

work page arXiv 2022
[27]

arXiv:2511.03237v2

URL https://arxiv.org/abs/2511.03237. arXiv:2511.03237v2. Krutrim AI Labs, Bangalore. Acceptance attested by authors at github.com/ola-krutrim/MUTANT. MUTANT-Indic tokenizer artifact not publicly available at our paper preparation date. Sarvam AI. Sarvam-1: 2B-parameter base model for Indian languages, 2024a. URL https://huggingface. co/sarvamai/sarvam-1....

work page arXiv
[29]

URL https://arxiv.org/abs/2505.09388. A Structural Diagnostics A.1 Cross-script tokens and byte-length ceilings: 14 tokenizers Table 14 reports structural properties for all 14 tokenizers in our benchmark, measured by each tokenizer’s own de- coder. At a 32-byte ceiling and zero cross-script tokens, only BrahmicTokenizer-131K and o200k_cropped satisfy bot...

work page internal anchor Pith review Pith/arXiv arXiv

[1] [1]

O. Ahia, S. Kumar, H. Gonen, J. Kasai, D. Mortensen, N. A. Smith, and Y . Tsvetkov. Do all languages cost the same? Tokenization in the era of commercial language models. In Proceedings of the 2023 Conference on Empirical Meth- ods in Natural Language Processing (EMNLP) ,

2023

[2] [2]

arXiv:2305.13707

URL https://arxiv.org/abs/2305.13707. arXiv:2305.13707. Documents per-language pricing/latency costs induced by tokenizer fertility differentials. J. Austin, A. Odena, M. Nye, et al. Program synthesis with large language models,

work page arXiv

[3] [3]

Program Synthesis with Large Language Models

URL https://arxiv. org/abs/2108.07732. MBPP-sanitized. A. Bendale, M. Sapienza, S. Ripplinger, S. Gibbs, J. Lee, and P . Mistry. SUTRA: Scalable multilingual language model architecture,

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

arXiv:2405.06694

URL https://arxiv.org/abs/2405.06694. arXiv:2405.06694. Multilingual LLM architecture cited as comparison baseline by MUTANT ( Rana et al. , 2026). M. Chen, J. T worek, H. Jun, et al. Evaluating large language models trained on code,

work page arXiv 2026

[5] [5]

Evaluating Large Language Models Trained on Code

URL https://arxiv. org/abs/2107.03374. HumanEval: 164 Python problems. K. Cobbe, V . Kosaraju, M. Bavarian, et al. Training verifiers to solve math word problems,

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

Training Verifiers to Solve Math Word Problems

URL https: //arxiv.org/abs/2110.14168. GSM8K: grade-school math word problems. M. R. Costa-jussà, J. Cross, O. Çelebi, M. Elbayad, K. Heafield, K. Heffernan, E. Kalbassi, J. Lam, D. Licht, J. Maillard, et al. No language left behind: Scaling human-centered machine translation,

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

No Language Left Behind: Scaling Human-Centered Machine Translation

URL https: //arxiv.org/abs/2207.04672. arXiv:2207.04672. Releases the FLORES-200 dev+devtest benchmark (2009 sentences per language). A later version of this work appeared as Costa-jussà et al., Nature 630:841–846,

work page internal anchor Pith review Pith/arXiv arXiv 2009

[8] [8]

arXiv:2304.08177

URL https://arxiv.org/abs/2304.08177. arXiv:2304.08177. Vocabulary expansion of LLaMA for Chi- nese; closest prior precedent for adding script coverage to an existing tokenizer via vocabulary surgery rather than from-scratch training. DeepSeek AI. DeepSeek-R1 model artifacts,

work page arXiv

[9] [10]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

URL https://arxiv.org/abs/2501.12948. 18 BrahmicTokenizer-131K TECHNICAL REPORT S. Doddapaneni, M. S. U. R. Khan, D. Verma, A. Vasanthkumar, A. Lavania, A. Kunchukuttan, M. M. Khapra, and R. Dabre. Towards leaving no Indic language behind: Building monolingual corpora, benchmark and models for Indic languages. In Findings of the Association for Computatio...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[10] [11]

arXiv:2212.05409

URL https://arxiv.org/abs/2212.05409. arXiv:2212.05409. Releases IndicBERT v2 / IndicNLG-Suite. P . Gage. A new algorithm for data compression. C Users Journal , 12(2):23–38,

work page arXiv

[11] [12]

URL https://openreview.net/forum?id= vfT4YuzAYA

ISSN 2835-8856. URL https://openreview.net/forum?id= vfT4YuzAYA. arXiv:2305.16307. Introduces the IN22 benchmark suite including IN22-Gen. J. Gala, T. Jayakumar, J. A. Husain, A. Kumar M, M. S. U. R. Khan, D. Kanojia, R. Puduppully, M. M. Khapra, R. Dabre, R. Murthy, and A. Kunchukuttan. Airavata: Introducing Hindi instruction-tuned LLM. arXiv preprint ar...

work page arXiv

[12] [13]

Gemma Team

URL https://arxiv.org/abs/2401.15006. Gemma Team. Gemma 3 technical report. arXiv preprint arXiv:2503.19786,

work page arXiv

[13] [14]

Gemma 3 Technical Report

URL https://arxiv.org/ abs/2503.19786. Google DeepMind. Gemma 3 model artifacts,

work page internal anchor Pith review Pith/arXiv arXiv

[14] [16]

The Llama 3 Herd of Models

URL https://arxiv.org/abs/2407.21783. Llama 3.1 family; tokenizer vocab size 128,256. É. Grave, A. Joulin, M. Cissé, D. Grangier, and H. Jégou. Eﬀicient softmax approximation for GPUs. In Proceedings of the 34th International Conference on Machine Learning (ICML),

work page internal anchor Pith review Pith/arXiv arXiv

[15] [17]

Efficient softmax approximation for GPUs

URL https://arxiv.org/abs/ 1609.04309. arXiv:1609.04309. Adaptive softmax: frequency-bucketed output layers for large-vocabulary lan- guage models. Hugging Face. HuggingFace tokenizers library,

work page internal anchor Pith review Pith/arXiv arXiv

[16] [18]

doi: 10.18653/v1/2024.acl-long.843

Association for Compu- tational Linguistics. doi: 10.18653/v1/2024.acl-long.843. URL https://aclanthology.org/2024. acl-long.843/. arXiv:2403.06350. ACL 2024 Outstanding Paper. Releases the Sangraha pre-training corpus (251B tokens, 22 Indic languages) and IndicAlign. Krutrim AI Labs. Krutrim-1-instruct: A multilingual base model and tokenizer for Indian ...

work page doi:10.18653/v1/2024.acl-long.843 2024

[17] [19]

Krutrim LLM: Multilingual Foundational Model for over a Billion People

URL https://huggingface.co/krutrim-ai-labs/Krutrim-1-instruct . Vocab size 70,212, Indic-focused. Tokenization details discussed in arXiv:2407.12481 (Kallappa et al., “Krutrim LLM: Multilingual Foundational Model for over a Billion People”, 2024). A. Liu, J. Hayase, V . Hofmann, S. Oh, N. A. Smith, and Y . Choi. SuperBPE: Space travel for language models....

work page arXiv 2024

[18] [20]

arXiv:2503.13423

URL https://arxiv.org/ abs/2503.13423. arXiv:2503.13423. Multi-word subword vocabulary learning. Meta AI. Llama 3.1 model artifacts,

work page arXiv

[19] [21]

URL https://huggingface.co/meta-llama/Llama-3. 1-8B. HF model card for the Llama 3.1 8B base model. See Grattafiori et al. (2024) for the technical report. B. Minixhofer, F. Paischer, and N. Rekabsaz. WECHSEL: Effective initialization of subword embeddings for cross-lingual transfer of monolingual language models. In Proceedings of the 2022 Conference of ...

2024

[20] [22]

arXiv:2112.06598

URL https://arxiv.org/abs/2112.06598. arXiv:2112.06598. Seminal work on transferring a pretrained model to a new tokenizer/language via embedding initialization. 19 BrahmicTokenizer-131K TECHNICAL REPORT Mistral AI. Mistral-Nemo-Base-2407: A 12B-parameter base model with the Tekken tokenizer,

work page arXiv

[21] [23]

Released May 2024 with gpt-4o-2024-05-13

URL https://github.com/ openai/tiktoken/blob/main/tiktoken_ext/openai_public.py. Released May 2024 with gpt-4o-2024-05-13. Vocab size 200,019. OpenAI. GPT-OSS-120B: An open-weights language model from OpenAI,

2024

[22] [24]

tokenization tax

URL https://arxiv.org/abs/ 2305.15425. arXiv:2305.15425. Quantifies cross-language tokenization disparities for closed and open tokeniz- ers; provides the framing for “tokenization tax” in low-resource languages. Qwen Team. Qwen3 model artifacts,

work page arXiv

[23] [25]

HF model card

URL https://huggingface.co/Qwen/Qwen3-8B. HF model card. Tokenizer vocab size 151,669. See Y ang et al.(2025) for the technical report. A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever. Language models are unsuper- vised multitask learners. Technical report, OpenAI,

2025

[24] [26]

arXiv:2104.05596

URL https://aclanthology.org/ 2022.tacl-1.9/. arXiv:2104.05596. S. Rana, A. Menezes, A. Kulkarni, C. Khatri, and S. Agarwal. MUTANT: A recipe for multilingual tokenizer design. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL) . Association for Computational Linguistics,

work page arXiv 2022

[25] [27]

arXiv:2511.03237v2

URL https://arxiv.org/abs/2511.03237. arXiv:2511.03237v2. Krutrim AI Labs, Bangalore. Acceptance attested by authors at github.com/ola-krutrim/MUTANT. MUTANT-Indic tokenizer artifact not publicly available at our paper preparation date. Sarvam AI. Sarvam-1: 2B-parameter base model for Indian languages, 2024a. URL https://huggingface. co/sarvamai/sarvam-1....

work page arXiv

[26] [29]

URL https://arxiv.org/abs/2505.09388. A Structural Diagnostics A.1 Cross-script tokens and byte-length ceilings: 14 tokenizers Table 14 reports structural properties for all 14 tokenizers in our benchmark, measured by each tokenizer’s own de- coder. At a 32-byte ceiling and zero cross-script tokens, only BrahmicTokenizer-131K and o200k_cropped satisfy bot...

work page internal anchor Pith review Pith/arXiv arXiv